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Query-based  information  retrieval  refers  to  the  process  of  scoring  documents  given  a 
short  natural  language  query.  Query-based  information  retrieval  systems  have  been  devel¬ 
oped  to  support  searching  diverse  collections  such  as  the  world  wide  web,  personal  email 
archives,  news  corpora,  and  legal  collections.  This  thesis  is  motivated  by  one  of  the  tenets 
of  information  retrieval:  the  cluster  hypothesis.  We  define  a  design  principle  based  on  the 
cluster  hypothesis  which  states  that  retrieval  scores  should  be  locally  consistent.  We  refer  to 
this  design  principle  as  score  autocorrelation.  Our  experiments  show  that  the  degree  to  which 
retrieval  scores  satisfy  this  design  principle  correlates  positively  with  system  performance. 
We  use  this  result  to  define  a  general,  black  box  method  for  improving  the  local  consistency 
of  a  set  of  retrieval  scores.  We  refer  to  this  process  as  local  score  regularization.  We  demon¬ 
strate  that  regularization  consistently  and  significantly  improves  retrieval  performance  for 
a  wide  variety  of  baseline  algorithms.  Regularization  is  closely  related  to  classic  techniques 
such  as  pseudo-relevance  feedback  and  cluster-based  retrieval.  We  demonstrate  that  the 
effectiveness  of  these  techniques  may  be  explained  by  their  regularizing  behavior.  We  ar¬ 
gue  that  regularization  should  be  adopted  either  as  a  generic  post-processing  step  or  as  a 
fundamental  design  principle  for  retrieval  models. 
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other  collections  tend  be  surrounded  by  fewer  relevant  documents  . 29 

3.7  Artificial  2-dimensional  data  produced  to  represent  a  single  relevant 

cluster  (red  points)  in  the  midst  of  many  non-relevant  data  (black 
points).  The  top  row  shows  the  relevant  cluster  developing  as  the 
number  of  relevant  points  grows  from  1  to  25.  The  second  row  shows 
the  distributions  of  similarities  between  relevant  documents  (RR)  and 
relevant  and  non-relevant  documents  (NR).  The  third  row  shows  the 
distribution  of  local  precision.  Relevant  points  are  sampled  from  a 
Gaussian;  non-relevant  points  are  sampled  uniformly.  . 32 

3.8  Four  relevant  clusters  of  varying  sizes.  The  top  row  shows  the  relevant 

cluster  developing  as  the  number  of  relevant  points  per  relevant  cluster 
grows  from  1  to  25.  The  second  row  shows  the  distributions  of 
similarities  between  relevant  documents  (RR)  and  relevant  and 
non-relevant  documents  (NR).  The  third  row  shows  the  distribution  of 
local  precision.  Relevant  points  are  sampled  from  a  Gaussian; 
non-relevant  points  are  sampled  uniformly.  . 33 

3.9  Two  relevant  cluster  of  non-uniform  density.  The  first  subfigure  shows  fifty 

relevant  points  and  150  non-relevant  points  in  two  dimensions.  The 
second  subfigure  shows  the  distributions  of  similarities  between 
relevant  documents  (RR)  and  relevant  and  non-relevant  documents 
(NR).  The  third  subfigure  shows  the  distribution  of  local  precision . 34 

3.10  Microaveraged  values  of  the  Jardine-van  Rijsbergen  and  Voorhees  tests  for 

baseline  retrievals  using  the  tree  12  and  robust  collections.  The 
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4. 1  Retrieval  functions  on  the  document  graph.  We  constructed  a 

nearest-neighbor  document  graph  for  the  top  1000  documents  from  a 
retrieval.  Edges  were  colored  by  a  gradient  based  on  the  relevance  of 
each  connected  document.  High  retrieval  scores  are  associated  with 
red.  Low  retrieval  scores  are  associated  with  grey.  . 41 
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5. 1  Functions  in  one  dimension.  Each  value  on  the  horizontal  axis  may,  for 

example,  represent  a  one-dimensional  classification  code  such  as  a 
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set  of  functions  are  intended  to  describe  the  same  phenomenon  or 
signal,  we  can  develop  criteria  for  preferring  one  function  over  another. 

If  we  prefer  smoother  function,  we  would  dismiss  the  function  in  a  in 
favor  of  the  function  in  b.  The  process  of  smoothing  the  function  in  a 
into  the  function  in  b  is  a  type  of  regularization . 51 

5.2  Regularizing  retrieval  scores.  Documents  in  a  collection  can  often  be 

embedded  in  a  vector  space  as  shown  in  a.  When  presented  with  a 
query,  a  retrieval  system  provides  scores  for  all  of  the  documents  in  the 
collection  b.  Score  regularization  refers  to  the  process  of  smoothing  out 
the  retrieval  function  such  that  neighboring  documents  receive  similar 
scores  (c) . 52 

5.3  Smoothness  and  error  constraints  for  a  function  on  a  linear  graph.  In 

Figure  a,  the  smoothness  constraint  penalizes  functions  where 
neighboring  nodes  in  f  receive  different  values.  In  Figure  b,  the  error 
constraint  penalizes  functions  where  nodes  in  f  receive  values  different 


from  the  corresponding  values  in  y . 53 

5.4  Focal  Score  Regularization  Algorithm.  Inputs  are  n,  y,  k,  and  a.  The 

output  is  the  a  length  n  vector  of  regularized  scores,  f* . 56 

5.5  Unregularized  and  regularized  scores  for  the  query  “U.  S.  Restaurants  in 

Foreign  Fands” . 57 

5.6  Precision-recall  curves  for  regularized  tree  12  scores.  Mean  average 

precision  shown  in  parentheses . 59 

5.7  Precision-recall  curves  for  regularized  robust  scores.  Mean  average 

precision  shown  in  parentheses . 60 
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detailed  experiments . 61 


5.9  Performance  improvement  as  a  function  of  Faplacian  type.  For  each 
Faplacian  described  in  Section  5.2.1,  we  maximized  mean  average 
precision  using  10-fold  cross-validation  (left:  combinatorial  Faplacian, 
center:  normalized  Faplacian,  right:  approximate  Faplace-Beltrami). 

The  different  Faplacians  represent  different  degree  normalization 
techniques . 63 
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5.10  Performance  as  a  function  of  amount  of  regularization.  For  each  value  of 

a,  we  selected  the  values  for  k  and  t  maximizing  mean  average 

precision.  A  higher  value  for  a  results  in  more  aggressive 
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5.11  Performance  as  a  function  of  number  of  neighbors.  For  each  value  of  k,  we 

selected  the  value  for  a  and  t  maximizing  mean  average  precision.  If 
we  trust  the  distance  metric,  we  would  expect  the  performance  to 
increase  as  we  increase  the  number  of  neighbors . 65 

5.12  Improvement  in  mean  average  precision  for  TREC  query-based  retrieval 

tracks.  Each  point  represents  a  competing  run.  The  horizontal  axis 
indicates  the  original  mean  average  precision  for  this  run.  The  vertical 
axis  indicates  the  mean  average  precision  of  the  regularization  run. 

Red  points  indicate  an  improvement;  blue  points  indicate 
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5.13  Improvement  in  mean  average  precision  for  TREC  query-based  retrieval 
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CHAPTER  1 


INTRODUCTION 


In  information  retrieval,  we  develop  systems  to  help  a  searcher  locate  relevant  data  hid¬ 
den  in  some  large  set  of  retrievable  items.  Searchers  will  have  diverse  backgrounds  and 
needs.  A  searcher  may  know  a  lot  about  the  relevant  data  or  very  little;  a  searcher  may 
demand  a  few  relevant  items  or  an  exhaustive  list  of  relevant  items.  Unfortunately,  the  size 
of  the  relevant  data  set  is  much,  much  smaller  than  the  number  of  retrievable  items.  In  fact, 
the  relevant  data  set  may  consist  of  a  single  document,  a  small  set  of  sentences,  or  even  a 
one-word  answer.  If  the  collection  is  not  text,  the  set  of  retrievable  items  may  contain  ar¬ 
tifacts  such  as  images  or  movies.  Complicating  matters,  information  retrieval  systems  need 
to  perform  the  classification  of  documents  into  relevant  and  non-relevant  sets  for  many, 
arbitrary  queries. 

The  study  of  automatic  information  retrieval,  beginning  in  the  1950s,  has  had  a  veiy  long 
history  in  computer  science  (even  longer  if  we  include  study  outside  of  computer  science). 
Over  this  period,  scientists  have  developed  a  set  of  design  principles  for  information  retrieval 
systems.  When  presented  with  a  new  retrieval  scenario,  we  design  systems  or  models  with 
these  principles  in  mind.  Several  classic  design  principles  include  preferring  documents 
with  multiple  query  term  matches  to  those  with  fewer,  weighting  terms  according  to  their 
inverse  document  frequency,  rewarding  documents  with  query  terms  in  close  proximity,  and 
favoring  popular  documents.  We  can  also  view  these  principles  as  heuristics  which  systems 
or  models  must  satisfy  in  order  to  perform  well  [Fang  et  al.,  2004], 

One  principle  which  lies  at  the  foundation  of  information  retrieval  is  the  idea  of  doc¬ 
ument  clustering.  Originally  proposed  by  Jardine  and  van  Rijsbergen  in  1971,  the  cluster 
hypothesis  can  be  stated  as  [van  Rijsbergen,  1979], 

Closely  associated  documents  tend  to  be  relevant  to  the  same  requests. 

Jardine  and  van  Rijsbergen  were  interested  in  measuring  the  degree  to  which,  for  a  given 
request  or  query,  associated  documents  tended  to  has  the  same  relevance  to  the  searcher. 
In  this  thesis,  we  study  topic-based  information  retrieval.  Therefore,  when  we  refer  to  a 
document  as  relevant,  we  mean  that  the  document  satisfies  the  query’s  topical  requirement. 
This  definition  is  commonly  used  in  the  information  retrieval  community,  most  visibly  in 
the  TREC  conferences.  Similarly,  when  we  refer  to  the  associations  between  documents, 
we  mean  the  topical  associations  between  documents.  Topical  association  can  be  measured 
in  different  ways.  For  example,  Jardine  and  van  Rijsbergen  investigate  topical  associations 
and  use  metrics  such  as  term  overlap  statistics  and  the  cosine  correlation  in  order  to  detect 
topical  associations. 
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The  cluster  hypothesis  is  one  of  the  tenets  of  information  retrieval,  oft-cited  and  the  mo¬ 
tivation  for  numerous  algorithms.  Examples  include  clustering  the  corpus  and  retrieving  a 
single  cluster,  locally  interpolating  document  representations,  and  reducing  the  dimension¬ 
ality  of  the  entire  semantic  space.  Unfortunately,  the  algorithms  motivated  by  the  cluster 
hypothesis  are  quite  varied  and  it  is  often  difficult  to  measure  how  exactly  the  design  prin¬ 
ciple  is  being  incorporated  into  a  system. 

In  the  first  part  of  this  thesis,  we  will  focus  on  testing  the  cluster  hypothesis.  Consider  the 
small  set  of  documents  in  Figure  1.1  and  their  relationship  to  the  query  “dog”.  We  observe 
that  document  (a)  is  clearly  relevant  because  it  discusses  “dogs”  explicitly  and  frequently; 
document  (b)  is  much  less  relevant  because  it  mostly  discusses  cats  and  only  refers  to  “dogs” 
in  passing;  documents  (c)  and  (d)  are  also  relevant  even  though  they  use  the  scientific  term 
“canine”  instead  of  “dog”.  There  is  evidence  for  the  cluster  hypothesis  if,  given  the  query 
“dog”,  the  relevance  of  document  (a)  implies  the  relevance  of  documents  (c)  and  (d).  We 
will  develop  a  test  to  measure  the  extent  to  which,  for  a  scoring  of  documents,  the  cluster 
hypothesis  is  satisfied.  This  measure,  which  we  refer  to  as  score  autocorrelation ,  detects  the 
degree  to  which  a  system  scores  documents  consistently.  Assume  that,  given  the  query  “dog”, 
we  request  that  a  retrieval  system  score  the  documents  in  Figure  1 . 1 .  A  retrieval  based  solely 
on  term  frequencies  may  produce  a  set  of  scores  [3, 1,  0,  0]  for  documents  (a),  (b),  (c),  and 
(d).  This  retrieval  would  receive  a  low  score  autocorrelation  because  the  scores  of  related 
documents  (a),  (c),  and  (d)  are  very  different.  A  second  retrieval  system  may  produce  the 
set  of  scores  [3,  0, 2.5,  2.5],  This  retrieval  would  receive  a  higher  autocorrelation  because 
scores  of  related  documents  are  more  consistent. 

We  adopt  an  autocorrelation  measure  from  spatial  data  analysis  referred  to  as  the  Moran 
autocorrelation.  We  will  show  that  the  autocorrelation  of  a  set  of  retrieval  scores  in  an  affinity 
space  induced  by  a  similarity  measure  accurately  captures  the  behavior  in  our  example. 
With  this  consistency  measure  in  hand,  we  will  conduct  a  series  of  descriptive  experiments 
measuring  the  correlation  between  consistency  and  system  performance.  The  experiments 
in  the  first  part  of  this  thesis  demonstrate  the  following  two  results, 

1 .  The  local  consistency  of  a  retrieval  correlates  positively  with  performance. 

2.  Many  retrieval  models  fail  to  produce  autocorrelated  scores. 

We  demonstrate  these  results  for  a  large  set  of  baseline  retrieval  models  over  a  diverse  set  of 
retrieval  scenarios. 

In  the  second  part  of  this  thesis,  we  propose  a  method  for  improving  the  effectiveness 
systems  which  fail  to  produce  autocorrelated  scores  by  improving  local  score  consistency. 
Beginning  with  the  scores  from  some  baseline  retrieval  method,  we  develop  an  optimization 
problem  to  find  a  new  set  of  scores  which  maximizes  the  local  consistency.  We  refer  to  the 
process  of  finding  more  consistent  scores  as  local  score  regularization.  The  intuition  behind  our 
solution  is  simple.  Recall  our  documents  in  Figure  1.1.  If  the  initial  retrieval  produces  the 
scores  [3, 1,  0, 0],  we  will  search  the  space  of  all  score  vectors  for  one  which  improves  the 
consistency  between  documents  (a),  (c),  and  (d).  The  output  of  our  system  will  be  this  score 
vector. 

We  adopt  a  regularization  method  based  on  the  graph  Faplacian.  We  will  show  that  us¬ 
ing  the  Faplacian  directly  models  the  introduction  of  local  consistency  into  a  set  of  retrieval 
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The  dog  ( Cam's  lupus  familiaris )  is 
a  domestic  subspecies  of  the  wolf, 
a  mammal  of  the  Canidae  family 
of  the  order  Carnivora.  The  term 
encompasses  both  feral  and  pet 
varieties  and  is  also  sometimes 
used  to  describe  wild  canids  of 
other  subspecies  or  species.  The 
domestic  dog  has  been  (and 
continues  to  be)  one  of  the  most 
widely-kept  working  and 
companion  animals  in  human 
history,  as  well  as  being  a  food 
source  in  some  cultures.  The  dog 
is  also  the  first  animal  from  Earth 
to  enter  into  space  and  fly  into 
orbit. 


Molecular  systematics  indicate 
that  the  domestic  canine  ( Canis 
lupus  familiaris)  descends  from 
one  or  more  populations  of  wild 
wolves  ( Canis  lupus).  As  reflected 
in  the  nomenclature,  canines  are 
descended  from  the  wolf  and  are 
able  to  interbreed  with  wolves. 


The  Cat  ( Felis  silvestris  catus),  also 
known  as  the  Domestic  Cat  or 
House  Cat  to  distinguish  it  from 
other  felines,  is  a  small 
carnivorous  species  of  mammal 
that  is  often  valued  by  humans  for 
its  companionship  and  its  ability 
to  hunt  vermin.  It  has  been 
associated  with  humans  for  at 
least  9,500  years. 

Cats,  like  dogs,  are  digitigrades: 
they  walk  directly  on  their  toes, 
the  bones  of  their  feet  making  up 
the  lower  part  of  the  visible  leg. 


(b) 


(d) 


Figure  1.1.  Four  documents  related  to  the  query  “dog”.  All  content  appropriated  from 
Wikipedia. 
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scores.  With  this  algorithm  in  hand,  we  will  conduct  a  series  of  effectiveness  experiments 
measuring  the  improvement  in  performance  garnered  by  the  introduction  of  local  consis¬ 
tency.  The  experiments  in  the  second  part  of  this  thesis  demonstrate  the  following  two 
results, 

1 .  Improving  the  local  consistency  of  a  system  improves  performance. 

2.  Many  performance-improving  retrieval  methods  can  be  seen  as  indirectly  improving 
consistency 

Again,  we  perform  our  experiments  across  a  diverse  set  of  tasks,  demonstrating  clear  ben¬ 
efits  to  applying  regularization.  We  will  describe  in  detail  our  performance  improvements 
and  the  tasks  for  which  regularization  is  appropriate.  Specifically,  we  will  argue  that  reg¬ 
ularization  is  a  method  well-suited  for  high-recall  tasks  which  require  inspecting  deep  into 
the  ranked  list.  Because  one  of  our  fundamental  data  structures  is  the  similarity  matrix  be¬ 
tween  documents,  we  will  analyze  the  numerical  stability  of  regularization  as  a  function  of 
changes  to  this  similarity  matrix.  Furthermore,  the  relationship  of  regularization  to  the  clus¬ 
ter  hypothesis  allows  us  to  directly  analyze  classic  information  retrieval  approaches  from  the 
perspective  of  regularization  and  allows  us  to  develop  new  regularization-based  methods  for 
novel  tasks. 

Although  many  of  the  approaches  we  use  in  this  thesis  are  new  to  information  retrieval, 
the  arguments  and  motivations  are  classic.  This  thesis  makes  the  following  contributions  to 
information  retrieval, 

1 .  A  precision  prediction  method  directly  derived  from  the  Voorhees  cluster 
hypothesis  test.  We  develop  a  new  precision  prediction  method  directly  related  to 
the  Voorhees  cluster  hypothesis  test.  This  measure  is  attractive  because  of  its  relation¬ 
ship  to  established  tests  in  spatial  data  analysis. 

2.  A  large-scale  analysis  of  the  local  score  consistency  in  retrieval  systems. 

We  measure  the  amount  of  local  score  consistency  for  a  large  population  of  retrieval 
submissions  to  various  TREC  competitions.  The  results  indicate  that  many  retrieval 
systems  do  not  consider  local  score  consistency. 

3.  A  consistently  beneficial  document  re-ranking  algorithm.  We  describe  a  new 
method  based  on  the  graph  Laplacian  for  re-ranking  documents  based  on  improving 
local  score  consistency.  We  demonstrate  that  this  algorithm  is  generally  applicable 
and  easily-extendable  into  new  domains. 

4.  A  regularization-based  perspective  on  pseudo-relevance  feedback.  We  present 
an  extended  discussion  of  the  relationship  between  regularization  and  previous  re¬ 
search,  concluding  that  some  of  the  success  of  these  methods  may  be  explained  by 
their  effect  on  local  score  consistency. 

The  remainder  of  this  dissertation  proceeds  as  follows, 

Chapter  2:  Preliminaries  We  define  the  information  retrieval  task  and  review  sev¬ 
eral  classic  and  modern  algorithms  for  retrieval.  In  addition,  we  survey  text  similarity 
measures  used  in  information  retrieval. 
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Chapter  3:  The  Cluster  Hypothesis  in  Information  Retrieval  We  describe 
the  Jardine-van  Rijsbergen  and  Voorhees  tests  of  the  cluster  hypothesis.  We  present 
theoretical  and  experimental  arguments  for  using  the  Voorhees  measure. 

Chapter  4:  Autocorrelation  of  Retrieval  Scores  Starting  from  the  Voorhees 
test,  we  develop  score  autocorrelation  as  measure  of  local  score  consistency.  We  test 
the  correlation  between  local  consistency  and  performance  for  a  large  set  of  retrieval 
methods. 

Chapter  5:  Regularization  of  Retrieval  Scores  We  describe  an  algorithm  for 
improving  the  local  score  consistency  of  arbitrary  retrieval  methods.  We  provide  ex¬ 
perimental  evidence  demonstrating  the  effectiveness  of  regularization. 

Chapter  6:  Relationship  to  Other  Retrieval  Methods  We  present  a  compre¬ 
hensive  and  detailed  comparison  of  regularization  to  a  previous  retrieval  methods. 

Chapter  7:  Stability  of  Regularization  We  analyze  the  numerical  stability  of  reg¬ 
ularization  subject  to  different  similarity  measures. 

Chapter  8:  Extensions  and  Future  Work  We  describe  extensions  of  the  regular¬ 
ization  for  relevance  feedback,  cross-lingual  retrieval,  optimal  set  retrieval,  and  cross- 
media  retrieval. 
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CHAPTER  2 


PRELIMINARIES 


In  this  chapter,  we  will  review  several  retrieval  models  referenced  in  this  thesis.  This 
survey  is  by  no  means  exhaustive  with  respect  to  models  or  techniques  within  each  model. 
Instead,  we  focus  on  models  representative  of  classic  and  state  of  the  art  approaches.  We 
will  define  models  in  their  standard  notation  as  well  as  matrix  notation.  Matrix  and  vector 
conventions  are  presented  in  Table  B.  1 .  This  chapter  provides  a  background  for  algorithms 
and  techniques  used  later  in  this  thesis. 

2.1  The  Document  Collection 

A  collection  is  a  set  of  n  documents  which  exist  in  an  m-dimensional  vector  space  where 
m  is  the  size  of  the  vocabulary  and  elements  of  the  vectors  represent  the  frequency  of  the 
term  in  the  document.  The  set  of  documents  in  the  collection  will  often  be  indicated  as 
T>  =  {i|l  <  i  <  n}.  We  define  for  each  document  i  E  V  a  column  vector,  d*,  where  each 
element  of  the  vector  represents  the  frequency  of  the  term  in  document  i;  we  refer  to  this  as 
the  document  vector.  Transposing  and  stacking  up  the  n  document  vectors  defines  the  n  x  m 
collection  matrix  C. 

We  define  other  symbols  in  Table  B.2.  Elaborations  of  definitions  will  occur  when  no¬ 
tation  is  introduced. 

2.2  Retrieval  Scores 

A  set  retrieval  model  assigns  a  binary  prediction  of  relevance  to  each  document  in  the  col¬ 
lection.  The  user  then  scans  those  documents  predicted  to  be  relevant.  We  can  see  this  as 
a  mapping  or  function  from  documents  in  the  collection  to  a  binary  value.  Mathematically, 
given  a  query,  q,  a  set  retrieval  model  provides  a  function,  fq  :  T>  — >  {0, 1},  from  documents 
to  labels;  we  refer  to  fq  as  the  initial  score  function  or  initial  retrieval  for  a  particular  query. 
The  argument  of  this  function  is  the  retrieval  system’s  representation  of  a  document.  The 
values  of  the  function  provide  the  system’s  labeling  of  the  documents.  Notice  that  we  index 
functions  by  the  query.  We  note  this  to  emphasize  the  fact  that,  in  information  retrieval,  the 
score  function  over  all  documents  will  be  different  for  each  query.  Although  we  drop  the 
index  for  notational  convenience,  this  function  is  always  associated  with  a  particular  query. 

A  ranked  retrieval  model  assigns  some  rank  or  score  to  each  document  in  the  collection  and 
ranks  documents  according  to  the  score.  The  user  then  scans  the  documents  according  to 
the  ranking.  The  score  function  for  a  ranked  retrieval  model  maps  documents  to  real  values. 
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Given  a  query,  q,  the  model  provides  a  function,  fq  :  T>  — >  3?,  from  documents  to  scores. 
The  values  of  the  function  provide  the  desired  ranking  of  the  documents. 

In  this  section,  we  will  review  several  classic  and  state-of-the-art  ranked  retrieval  mod¬ 
els.  Each  retrieval  model  provides  a  method  for  defining  a  function  fq  :  T>  — >  Since 

each  function  can  be  treated  as  a  set  of  scores  assigned  to  an  indexed  set  of  documents,  we 
represent  score  function  using  the  vector  y  G  E". 

2.2.1  Vector  Space  Model 

The  vector  space  model  is  one  of  the  most  general  information  retrieval  models  [Salton, 
1968].  By  treating  a  query  as  a  very  short  document,  documents  and  queries  can  be  rep¬ 
resented  in  a  shared,  m-dimensional  space  and  scores  can  be  computed  using  the  cosine 
similarity  measure. 

In  Section  2.1,  we  described  document  vectors  as  consisting  of  raw  term  frequencies.  In 
practice,  the  elements  of  these  vectors  (and  therefore  G)  are  adjusted  to  weight  terms  accord¬ 
ing  to  their  relative  importance  in  the  document  and  discriminativeness  in  the  collection. 
These  are  referred  to  as  the  term  frequency  or  tf  weight  and  inverse  document  frequency  or 
idf  weight,  respectively.  Using  the  BM25  weights  [Robertson  and  Walker,  1994],  documents 
are  represented  as, 

~  di(k  +  1)  /  (n  +  0.5)  —  Q 

di  =  - 7 - - - —7 - 7 - TV  X  log  - - — - - - 

di  +  k  ((1  “  b)  +  b  (puts))  '  0.5 +  Q 

V _  ✓  N»  v _ 

Okapi  term  frequency  inverse  document  frequency 

where  d  is  a  length -m  document  vector  where  elements  contain  the  raw  term  frequency, 
the  vector  1  is  the  length-n  vector  of  document  lengths,  li  =  \\ d, 1 1 1,  and  c  is  the  length-m 
document  frequency  vector. 

The  cosine  similarity  between  the  query  and  document  can  be  computed  as, 

cos( q,  d)  =  q  d  -  (2.2) 

|| q|| 2  X  ||d||2 

which  is  equivalent  to  the  inner  product  between  L2 -normalized  vectors.  When  discussing 
the  vector  space  model,  we  will  assume  that  the  rows  of  C  are  reweighted  according  to  Equa¬ 
tion  2. 1  and  L2 -normalized.  We  will  also  assume  that  the  query  vector,  q,  is  Lo  normalized. 
Using  this  notation,  the  scores  for  all  documents  in  the  collection  can  be  represented  as, 

y  =  Cq  (2.3) 

Pseudo-relevance  feedback  or  query  expansion  refers  to  the  technique  of  using  information  from 
the  top  r  documents  retrieved  by  the  original  query.  The  system  then  performs  a  second 
retrieval  using  combination  of  this  information  and  the  original  query.  One  way  to  incorpo¬ 
rate  this  information  is  to  assume  that  the  top  r  documents  are  relevant  [Croft  and  Harper, 
1979],  If  the  top  r  documents  are  assumed  to  be  relevant,  we  can  use  the  classic  Rocchio 
technique  for  incorporating  additional  terms  [Rocchio,  1971],  Let  the  pseudo-relevant  set  be 
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R  and  r  =  R\ .  In  Rocchio  feedback,  we  linearly  combine  the  vectors  of  documents  in  R 
with  the  original  query  vector,  q.  The  modified  query,  q,  is  defined  as, 

q  =  q+^i;d,  (2.4) 

ieR 

where  a  is  the  weight  placed  on  the  information  from  the  pseudo-relevant  documents.  We 
can  use  this  new  representation  to  score  documents  by  their  similarity  to  q, 

y  =  Cq  (2.5) 


2.2.2  Language  Model  Scores 

In  the  language  modeling  approach  to  information  retrieval,  terms  found  in  a  document 
represent  samples  from  some  underlying  m-dimensional  multinomial  over  terms  in  the  vo¬ 
cabulary  [Croft  and  Lafferty,  2003] .  Each  document  in  the  collection  is  associated  with  a 
unique  multinomial  which  is  referred  to  as  the  document  language  model.  A  document  language 
model  can  be  estimated  from  the  terms  occurring  in  the  text.  A  user’s  query  is  treated  as 
an  unordered  bag  of  words.  Then,  given  a  query,  documents  can  be  ranked  by  their  prob¬ 
ability  of  having  generated  the  query  sequence.  The  intuition  behind  this  ranking  is  that 
documents  which  are  more  likely  to  have  generated  the  query  are  more  likely  to  be  relevant. 
This  ranking  method  is  referred  to  as  query  likelihood  retrieval. 

There  are  many  ways  to  estimate  a  document  language  model.  Let  d  contain  the  raw 
term  frequencies  for  a  document  and  P{w\9d)  be  the  estimation  of  the  document  language 
model.  The  maximum  likelihood  estimate  of  P(w\6d)  defined  as, 

P(w\ed)  =  A-  (2,6) 

lldHl 

When  estimating  a  distribution,  especially  with  a  small  sample,  it  is  statistically  attractive 
to  reserve  some  probability  mass  for  unseen  events.  For  example,  given  a  document  about 
dogs,  even  if  we  never  saw  the  word  “canine”  or  “cat”,  we  would  like  to  think  that,  if  the 
author  continued  writing,  these  words  would  occur  with  some  probability.  The  assignment 
of  non-zero  weights  to  unseen  terms  is  referred  to  as  smoothing.  One  popular  and  effective 
method  for  smoothing  document  models  is  to  use  the  conjugate  prior  of  the  distribution;  for 
the  multinomial,  this  would  be  the  Dirichlet  distribution.  Using  Dirichlet  smoothing,  We 
estimate  the  document  language  model  as, 

d,  =  UtifWM  (2.7) 

1 1  d  1 1 1  +  n 

where  P(w\0c)  is  the  maximum  likelihood  collection  model  defined  as, 

P(  ®,|0)  =  AA  (2.8) 

2-iik  ik 

A  comparison  of  alternative  smoothing  methods  for  information  retrieval  can  be  found  in 
[Zhai  and  Lafferty,  2004] . 
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Given  a  query,  we  rank  documents  according  to  their  query  likelihood.  We  can  write 
the  query  as  a  length-m  vector  where  elements  contain  the  frequency  of  each  term  in  the 
query  Using  the  common  assumption  that  query  terms  are  independently  sampled  from 
the  underlying  model,  a  document’s  score  can  be  written  as, 

m 

P(Q\0d)  =  Y[P(wi\9d)qi 

i= 1 
m 

r^t^gP(Wi\9d) 

i=  1 

0(1  t^ogP(wM 

*= i 

m 

=  Y  P(wi\9q)  \ogP(wi\Qd)  (2.9) 

i= 1 

In  the  second  line  of  this  derivation,  we  take  the  logarithm  of  the  product  which  preserves 
the  rank  ordering  of  documents  and  results  in  a  linear  scoring  function.  In  the  third  line, 
we  multiply  by  the  reciprocal  of  the  query  length;  which  also  preserves  the  ordering  of  the 
documents.  Finally,  we  recognize  the  maximum  likelihood  query  model,  P(w\9q),  in  the 
formula  using  Equation  2.6.  Note  here  that  the  score  is  actually  the  cross  entropy  between 
the  document  model,  P(w\9d),  and  the  query  model,  P(w\9q).  We  can  write  this  function 
in  vector  notation  as, 

P(Q\9d)  r=k  [logd]Tq  (2.10) 

where  q  and  d  are  language  models.  When  discussing  the  language  model  approaches,  we 
will  assume  that  the  rows  of  C  are  smoothed  according  to  Equation  2.7.  We  will  also  assume 
that  the  query  vector,  q,  is  L\  normalized.  LTsing  this  notation,  the  scores  for  all  documents 
in  the  collection  can  be  represented  as, 

y  =  (logC)q  (2.11) 

In  the  language  modeling  framework,  pseudo-relevance  feedback  can  be  defined  in  sev¬ 
eral  ways.  We  focus  on  the  “relevance  model”  technique  [Lavrenko,  2004] .  In  relevance 
modeling,  the  original  scores  are  used  as  weights  for  the  estimated  relevance  model.  The 
relevance  model,  P(w\9r),  is  formally  constructed  by  interpolating  the  maximum  likelihood 
query  model,  P{w\9q),  and  relevance-weighted  document  models,  P{w\9d), 

P(M9r)  =  AP(H<y  +  (l-A)  (2.12) 

\d£R  ^  j 

where  Z  =  J2dgr  P{Q\^d)  which  means  we  are  using  an  Li  normalized  version  of  y.  This 
is  clearer  if  we  represent  P{w\9r)  in  matrix  notation, 

q  =  Aq  +  ^~.A^CTy  (2.13) 

llylli 

Cross  entropy  scoring  can  be  used  because  q  is  a  language  model.  The  document  scores 
after  pseudo-relevance  feedback  are, 

y  =  (logC)q  (2.14) 


9 


2.2.3  Feature-Based  Retrieval 

Both  the  vector  space  model  and  the  unigram  language  model  approaches  to  informa¬ 
tion  retrieval  represent  documents  as  unordered  bags  of  words.  In  feature-based  retrieval, 
documents  are  still  represented  as  vectors.  However,  the  components  of  these  vectors  do  not 
represent  words.  Instead,  components  represent  the  features  we  expect  relevant  documents 
to  have.  We  expect  relevant  documents  to  contain  many  occurrences  of  query  terms  which 
may  be  represented  as  the  feature.  Given  the  query  sequence  Q  and  a  document  D,  this 
feature  may  be  defined  by 


tpt(Q,  D)  =  ^qidi  (2.15) 

i 

where  q  is  a  vector  of  query  terms,  d  is  our  m  X  1  document  term  vector,  and  ^  is  our 
document  feature  vector.  Alternatively,  we  could  use  one  of  the  term-based  scores  com¬ 
puted  by  bag  of  words  approaches.  The  attractive  aspect  of  feature -based  retrieval  is  that 
we  can  represent  more  complex  features  in  i/j  as  well.  For  example,  if  we  are  interested  in 
the  proximity  between  query  terms,  we  might  define  the  following  feature, 

ipo(Q,  D)  =  Qiqjdij  (2.16) 

ij 


where  d  is  a  m2  X  1  document  proximity  vector  indicating  the  frequency  of  co-occurrence 
of  i  and  j  within  some  window  of  terms.  We  can  also  define  query-independent  features 
such  as  the  PageRank  or  document  quality  [Brin  and  Page,  1998;  Zhou  and  Croft,  2005], 
In  this  thesis,  Metzler’s  Markov  random  held  (MRF)  model  of  retrieval  represents  the 
family  of  feature -based  retrieval  methods  [Metzler  and  Croft,  2005] .  The  MRF  model  of 
retrieval  computes  a  document  the  joint  probability  of  D  and  Q  as 


(2.17) 


where  A  is  a  feature  vector  and  Z\  =  J2d,q  EL  Vc  If  we  take  the  logarithm  of  this  equation, 
we  can  derive  the  following  ranking  function, 


log Pg,k{QiD)1=  ^td  /td(c)  +  A0d  foD(c)  +  XuD  X!  fuD(c) 

CdzTz)  C^zO  £)  CtzU E) 

X - v - '  s - v - '  s - V - ' 

terms  ordered  pairs  unordered  pairs 

where  TD  are  terms  in  Q,  Op  are  ordered  pairs  of  terms  in  Q,  and  U o  are  unordered  pairs 
of  terms  in  ().  The  operators  /*  are  functions  of  the  occurrence  of  those  terms  (or  pairs)  in 
D} 

Whe  intimidating  math  often  used  to  describe  the  MRF  betrays  the  simple  reduction  to  a  linear  combi¬ 
nation  of  document  scores.  Similar  methods  have  previously  been  used  in  metasearch  [Montague  and  Aslam, 
2001],  In  fact,  the  parallels  between  feature-based  methods  and  metasearch  have  allowed  feature-based  meth¬ 
ods  to  be  applied  directly  to  the  metasearch  problem  [Carterette  and  Petkova,  2006;  Yue  et  ah,  2007], 
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2.2.4  Summary 

There  are  a  few  observations  about  these  retrieval  functions  worth  noticing.  First,  in 
most  cases,  document  scores  are  calculated  independently.  There  is  often  no  explicit  rep¬ 
resentation  of  the  score  dependencies  between  documents.  Second,  we  rank  documents  by 
decreasing  scores.  We  are  confident  that  the  highly  ranked  documents  are  likely  to  relevant; 
we  are  less  confident  that  the  lower  ranked  documents  are  relevant.  Treating  a  score  as  a 
confidence  is  different  from  treating  it  as  a  label.  If  we  treat  the  score  as  a  confidence  of 
relevance,  there  is  no  way  to  represent  confidence  that  a  document  not  relevant.  We  will 
return  to  this  thought  near  the  end  of  the  thesis.  Finally,  these  functions  in  practice  have 
very  skewed  distributions  of  scores;  most  of  the  n  documents  in  the  corpus  and  even  most 
of  the  top  n  retrieved  documents  have  very  low  scores  when  compared  to  the  top-ranked 
documents.  We  present  the  score  distributions  for  a  few  queries  in  Figure  2.1.  This  behavior 
is  consistent  across  most  retrieval  algorithms  [Manmatha  et  al.,  2001], 

These  retrieval  models  represent  a  very  small  but  representative  view  of  information  re¬ 
trieval  methods.  One  good  catalog  of  alternative  retrieval  methods  can  be  found  in  proceed¬ 
ings  of  the  Text  Retrieval  Conference  (TREC)  [Voorhees  and  Harman,  2001],  In  order  to 
demonstrate  the  generalizability  of  our  results,  we  have  attempted,  when  possible,  to  present 
experiments  which  use  the  algorithms  presented  in  Sections  2.2.1 -2.2.3  as  well  as  the  larger 
population  of  algorithms  produced  at  the  TREC  conferences. 

2.3  Inter- document  Relationships 

In  many  collections,  documents  exist  as  independent  of  one  another;  that  is,  there  is  no 
explicit  relationship  between  any  pair  of  documents.  A  news  corpus,  a  collection  of  newswire 
articles,  is  known  to  have  this  property  in  many  situations.  However,  we  know  that,  while 
not  explicit,  relationships  between  documents  indeed  exist.  For  example,  two  news  articles 
about  “hostage-taking”  have  shared  topic  and  therefore  have  a  topical  relationship.  In  this 
section,  we  review  prior  work  which  studied  inter-document  relationships.  We  will  describe 
the  definition  and  detection  of  inter-document  relationships. 

Depending  on  one’s  perspective,  two  documents  may  be  related  in  different  ways.  For 
example,  documents  may  be  related  by  a  citation,  a  hyperlink,  coauthors,  or  shared  top¬ 
ics.  When  studying  some  property  of  the  collection,  we  often  must  select  from  the  set  of  all 
possible  relationships.  The  task  of  interest  should  guide  the  selection  an  appropriate  inter¬ 
document  relationship  (or  set  of  relationships).  For  example,  because  this  thesis  is  concerned 
with  retrieving  documents  on  a  certain  topic,  we  focus  on  inter-document  topical  relation¬ 
ships.  Although  other  relationships  might — and  often  do — correlate  with  shared  topics,  the 
fundamental  task  is  driven  by  modeling  the  topics  discussed  in  documents. 

Inter-document  relationships  traditionally  determined  by  explicit  labeling  can  also  be 
inferred  from  similarity  of  the  language  shared  between  two  documents.  Several  similarity 
measures  have  been  proposed  and  are  the  foundation  of  many  classic  clustering  algorithms 
[Lance  and  Williams,  1967],  A  similarity  measure’s  effectiveness  can  be  determined  indi¬ 
rectly  by  its  influence  on  a  task  such  as  document  classification  or  query-based  retrieval.  In 
the  topic  detection  and  tracking  (TDT)  literature,  inter-document  similarity  is  evaluated  di¬ 
rectly  by  comparing  system  predictions  with  human  judgments;  because  the  TDT  program 
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was  conducted  in  the  context  of  news  documents,  this  evaluation  is  referred  to  as  “story  link 
detection”  [Allan,  2002]. 

In  this  thesis,  we  will  focus  on  two  high-performing  approaches  to  link  detection:  the 
classic  vector  space  model  and  language  model  [Chen  et  al.,  2004],  These  two  approaches 
are  related  to  the  score  functions  described  in  Section  2.2.  Both  of  these  approaches  repre¬ 
sent  documents  using  bags  of  words,  ignoring  proximity  or  phrase  information.  Our  focus 
on  unigram  techniques  is  motivated  by  the  lack  of  significant  improvement  when  dependen¬ 
cies  are  considered  [Bekkerman  and  Allan,  2004;  Nallapati  and  Allan,  2002]. 

Pairwise  relationships  between  documents  will  be  represented  by  the  n  x  n  symmetric 
matrix,  A.-  Each  similarity  measure  will  define  all  of  the  entries  for  A. 

2.3.1  Cosine  Similarity 

Recall  that  that  in  the  vector  space  model,  we  assume  that  each  document  vector,  d,.  is 
weighted  by  tf.idf  and  L2 -normalized.  The  cosine  between  document  vectors  determines 
affinity, 


cos  id,,  d j) 


(dj,  dj ) 

did; 


The  affinity  matrix  is  defined  by, 


A 


COS 


CCT 


(2.18) 


(2.19) 


2.3.2  Language  Model  Similarity 

When  represented  as  language  models,  documents  can  be  compared  using  a  multino¬ 
mial  similarity  measure. 

The  Kullback-Leibler  divergence  between  two  distributions  is  a  well-known,  theoretically- 
motivated  measure  of  dissimilarity. 


DKl( d?: K)  =  H{ di)  -  (d;  log^d,))  (2.20) 

where  H( d)  is  the  information  entropy  defined  by  H( d)  =  (d,  log(d));  the  second  term  is 
the  cross  entropy  between  i  and  j.  We  should  make  a  few  observations  about  the  Kullback- 
Leibler  divergence.  First,  the  measure  is  asymmetric  (ie,  D KL ( d, 1 1 d? )  ^  D KL(dj\\dt)). 
Unfortunately,  the  semantics  of  the  asymmetry  are  unclear.  This  makes  adoption  of  the 
Kullback-Leibler  divergence  problematic.  Second,  although  the  measure  is  zero  when  two 
multinomial  are  equal,  there  is  no  theoretical  maximum  for  arbitrary  multinomials. 

2We  assume  symmetric  relationships  representing  the  sharing  of  a  topic.  We  exclude  asymmetric  topical 
relationships  because  they  introduce  assumptions  such  as  containment  and  entailment  which  have  not  been 
throughly  studied  in  the  information  retrieval  literature. 
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In  order  to  address  some  of  the  problems  with  the  Kullback-Leibler  divergence,  Lin 
proposed  an  alternative  measure  inspired  by  Shannon  entropy  [Lin,  1991],  The  Jensen- 
Shannon  divergence  is  defined  as, 

J <SV(dj,  dj)  =  Hfadi  +  Kjdj)  -  (7TiLr(di)  +  njH(dj))  (2.21) 

where  tt,  +  tt:J  =  1  .  In  order  for  this  measure  to  be  symmetric,  we  set  tt,  =  i Tj .  This  results 
in  the  following  derivation, 

H  Q(dj  +  dy)^  -  dj)  +  H(dj))  oc  DKL  ^df  ^(d t  +  dj))  +  DKL  ^dj  ^(dt  +  dj)\ 

The  Bhattacharyya  distance  measures  the  angle  between  multinomials  and  has  been 
used  for  link  detection  in  the  past  [Chen  et  al.,  2004],  The  Bhattacharyya  distance  is  defined 
as, 

B(  di,dj)  =  (Vd l^fdj)  (2.22) 

A  related  measure  is  the  Hellinger  distance,  dehned  as, 

H( di,dj)  =  \\Vdi-  \fdjf2  (2.23) 

=  2(1  —  B{di)  dj)) 

The  performance  of  these  measures,  although  comparable,  have  not  shown  to  improve  per¬ 
formance  for  tasks  such  as  link  detection  or  clustering  [Chen  et  al.,  2004], 

Lebanon  and  Lafferty  propose  a  kernel  for  multinomials  based  on  diffusion  over  the 
multinomial  manifold  [Lafferty  and  Lebanon,  2005] .  This  affinity  measure  between  two 
distributions  is  motivated  by  the  Fisher  information  metric  and  defined  as, 

/Q(dj,  dj)  =  exp  arccos2  6’id,  .  d , ) )  (2.24) 

where  t  is  a  parameter  controlling  the  decay  of  the  affinity.  The  diffusion  kernel  has  been 
shown  to  be  a  good  affinity  metric  for  tasks  such  as  text  classification  [Lafferty  and  Lebanon, 
2005] .  In  fact,  when  two  documents  are  very  similar,  the  diffusion  kernel  is  nearly-equivalent 
to  the  square  root  of  the  Kullback-Leibler  divergence.  For  text,  the  Bhattacharyya  distance 
and  the  multinomial  diffusion  kernel  are  attractive  for  theoretical  reasons.  Lafferty  and 
Lebanon  note  [Lafferty  and  Lebanon,  2005,  p.  139], 

The  Fisher  information  metric  places  greater  emphasis  on  points  near  the  bound¬ 
ary,  which  is  expected  to  be  important  for  text  problems,  which  typically  have 
sparse  statistics. 

For  this  reason,  we  adopt  these  measures  in  our  experiments  and  define  the  following  two 
similarity  matrices, 

XB  =  (\/c)  (\/c)T  (2.25) 

A ,jct  =  exp(— t^1  arccos2(Ag))  (2.26) 

We  plot  the  relationship  between  the  Bhattacharyya  distance  and  the  diffusion  kernel  in 
Figure  2.2. 
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Figure  2.2.  Relationship  between  Bhattacharyya  and  diffusion  kernels. 


2.3.3  Visualizing  Inter-document  Affinity 

In  this  thesis,  we  will,  at  times,  visualize  the  matrix  A  in  order  to  support  explanation.  In 
this  short  section,  we  will  describe  the  process  for  generating  these  two-dimensional  visual¬ 
izations.  We  caution  that  projecting  from  m  dimensions  to  two  obscures  many  potentially- 
interesting  observations.  Therefore,  throughout  this  thesis,  we  will  be  using  visual  illustra¬ 
tions  to  provide  intuition  for  algorithms  and  measurements,  not  evidence  of  effectiveness. 

Assuming  all  documents  share  at  least  small  subset  of  vocabulary,  the  affinity  matrix,  A, 
contains  n 2  non-zero  entries.  We  make  the  matrix  sparser  by  including  only  the  A; -largest 
similarities  for  each  document.  More  concretely,  assume  that  Si  is  the  size  k  set  of  indexes 
of  the  maximum  values  in  the  ith  row  of  A.  The  sparser  matrix,  W,  is  defined  as, 


Wij 


Aij  if  A  G  Sj  or  j  E  Sj 
0  otherwise 


(2.27) 


In  addition  to  sparsifying  A,  we  also  make  the  affinity  matrix  tractable  by  using  query-based 
samples  of  size  n  from  the  collection.  Because  our  analysis  is  query-based,  this  means  we 
have  a  different  matrix  for  each  query. 

Given  a  pairwise  affinity  matrix,  we  can  use  a  number  of  techniques  for  embedding  the 
data  in  two  dimensions.  Later  in  this  thesis,  we  will  use  the  combinatorial  Laplacian  defined 
for  W  in  order  to  analyze  retrieval  functions.  The  Laplacian  provides  a  robust,  diffusion- 
based  embedding  into  two  dimensions  [Coffman  and  Lafon,  2006],  Alternatively,  when 
considered  a  graph,  the  affinity  matrix  can  be  projected  using  spring-embedding  or  energy- 
based  graph  drawing  techniques  [Leuski,  2001;  Adai  et  al.,  2004],  We  compare  projections 
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based  on  the  Laplacian  and  a  spring-embedding  in  Figure  2.3.  Although  we  notice  that  some 
of  the  coarse  structure  in  the  projections  is  similar,  the  embedding  based  on  the  Laplacian 
results  in  a  less-intuitive  rendering.  Visualization  based  on  the  eigenvectors  of  the  Laplacian 
arises  from  low  frequency  harmonics  on  the  graph.  This  results  in  a  nice  visualization  of 
coarse  graph  structure.  Spring  embedding  captures  lower  level  structure  but  introduces 
some  high  level  error.  Because  retrieval  focuses  on  fine  grained  clusters  of  documents,  we 
adopt  the  spring  embedding  layout  when  illustrating  data  or  algorithmic  effects.  Although 
we  abandon  the  Laplacian  for  visualization,  we  will  reintroduce  it  in  Chapter  5  for  analyzing 
score  functions. 

In  this  thesis,  we  use  the  Large  Graph  Layout  spring-embedding  algorithm  to  visualize 
document  graphs  [Adai  et  al.,  2004],  We  present  an  example  graph  for  different  values  of 
k  and  n  in  Figure  2.4. 

2.3.4  Summary 

Inter-document  similarity,  as  presented  here,  reduces  to  a  function  of  the  inner  product 
of  two  document  vectors.  As  we  mentioned  earlier,  approaches  incorporating  dependencies 
between  dimensions  have  usually  demonstrated  only  slight  improvements  in  link  detection. 
Our  adoption  of  linear  similarity  measures  will  allow  us  to  analyze  a  number  of  retrieval 
methods  in  a  general  way  in  Chapter  6. 
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(a)  diffusion  maps 


(b)  spring-embedding 


Figure  2.3.  Comparison  of  two-dimensional  embedding  using  diffusion  maps  and  Large 
Graph  Layout. 
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Figure  2.4.  Document  graphs  for  the  TREC  query  “The  Effectiveness  of  Medical  Products 
and  Related  Programs  Utilized  in  the  Cessation  of  Smoking”.  Document  graphs  are  built 
using  the  top  n  retrieved  documents.  Edges  are  added  for  the  k  nearest  neighbors  of  each 
document. 
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PARTI 


AUTOCORRELATION  OF  RETRIEVAL  SCORES 


CHAPTER  3 


THE  CLUSTER  HYPOTHESIS  IN  INFORMATION  RETRIEVAL 


The  cluster  hypothesis,  as  posed  by  van  Rijsbergen,  states:  “closely  associated  docu¬ 
ments  tend  to  be  relevant  to  the  same  [queries]”  [van  Rijsbergen,  1979],  In  general,  this 
hypothesis  has  been  tested  using  two  approaches.  The  first  approach,  originally  proposed  by 
Jardine  and  van  Rijsbergen,  tests  the  degree  to  which  relevant  documents  exist  as  a  single, 
cohesive  cluster  distinct  from  the  non-relevant  documents.  The  second  approach,  advanced 
by  Voorhees  as  an  alternative  to  the  Jardine-van  Rijsbergen  test,  measures  the  degree  to 
which  relevant  documents  are  related  to  other  relevant  documents.  The  confirmation  of 
the  cluster  hypothesis  for  an  individual  query,  justifies  the  incorporation  of  inter-document 
similarity  into  our  final  document  ranking.  Either  test,  Jardine-van  Rijsbergen  or  Voorhees, 
motivates  its  own  set  of  approaches.  We  will  compare  the  assumptions  and  behaviors  of  the 
Jardine-van  Rijsbergen  and  Voorhees  tests,  arguing  that  the  Voorhees  test  is  more  robust  and 
appropriate  for  information  retrieval.  This  chapter  provides  a  foundation  for  development 
of  local  score  consistency  in  the  next  chapter. 

3.1  The  Jardine-van  Rijsbergen  Test 

Jardine  and  van  Rijsbergen  test  the  cluster  hypothesis  by  comparing  the  distribution  of 
similarities  between  relevant  documents  and  the  distribution  of  similarities  between  relevant 
and  non-relevant  document;  we  refer  to  these  distributions  as  RR  and  NR,  respectively.  In 
Figure  3.1,  we  indicate  the  submatrices  of  W  representing  the  similarities  between  relevant 
and  non-relevant  documents.  Histograms  of  similarities  found  in  these  submatrices  let  us 
estimate  distributions  of  similarities  within  and  between  the  classes.  Jardine  and  van  Rijs¬ 
bergen  test  the  following  hypothesis, 

Hypothesis  3.1.  The  RR  and  JVR  distributions  are  well-separated. 

In  Figure  3.2,  we  demonstrate  with  artificial  data  situations  where  the  cluster  hypothesis 
holds  and  when  it  does  not.  Jardine  and  van  Rijbergen  use  visual  inspection  as  evidence  for 
and  against  the  cluster  hypothesis.  In  subsequent  work,  van  Rijsbergen  and  Sparck  Jones 
suggested  that  the  degree  to  which  collections  satisfy  the  clustering  hypothesis  (by  visual 
inspection)  correlates  strongly  with  retrieval  performance  [van  Rijsbergen  and  Sparck  Jones, 
1973],  Quantitative  comparisons  between  the  two  distributions  might  include  comparing 
the  means.  Griffiths  et  al  test  for  the  cluster  hypothesis  by  measuring  the  overlap  between 
the  two  distributions  in  Figure  3.2  [Griffiths  et  al.,  1986]. 

In  the  most  general  interpretation,  Hypothesis  3. 1  makes  a  conjecture  about  the  relation¬ 
ship  between  all  documents  in  the  collection.  That  is,  if  we  select  two  random  documents 
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|R|  n  - 1 R I 


RR 

NR 

NR 

NN 

Figure  3.1.  Matrix  elements  used  for  testing  the  cluster  hypothesis. 


from  the  collection  and  if  they  are  topically  related,  then  we  should  expect  them  to  have 
the  same  relevance.  In  practice,  authors  often  confine  the  analysis  to  the  top  fi  documents 
retrieved  from  the  query  [Hearst  and  Pedersen,  1 996] . 

3.1.1  Approaches  Motived  by  the  Jardine-van  Rijsbergen  Test 

So  far,  we  have  described  how  one  might  test  Hypothesis  3.1.  If  there  is  evidence  that 
this  hypothesis  is  true  for  a  query  or  collection,  how  do  we  change  our  retrieval  methods  to 
exploit  clustering?  In  the  remainder  of  this  section,  we  will  review  several  techniques. 

3. 1 . 1 . 1  Clustering  Documents 

Clustering  refers  to  the  assignment  of  each  document  to  one  or  more  of  k  groups  of 
documents.  Documents  within  a  cluster  all  share  some  topic.  Clusters  can  be  defined 
manually  or  automatically.  Automatic  clustering  may  refer  to  any  number  of  algorithms 
and  document  representations;  algorithms  include  agglomerative  clustering  [Croft,  1980; 
Griffiths  et  al.,  1986;  van  Rijsbergen  and  Sparckjones,  1973],  partition-based  clustering 
[Hearst  and  Pedersen,  1996],  latent  semantic  analysis  [Deerwester  et  al.,  1990],  and  several 
language-model  based  techniques  [Xu  and  Croft,  1999;  Hofmann,  1999;  Liu  and  Croft, 
2004;  Kurland,  2006;  Wei  and  Croft,  2006], 

In  hard  clustering,  the  documents  in  the  collection  are  partitioned  in  k  topic-based  clus¬ 
ters.  We  can  represent  a  clustering  as  the  k  X  n  matrix,  V,  where  columns  are  binary 
vectors  indicating  the  cluster  membership  of  each  document.  The  classic  example  of  a  hard 
clustering  is  the  hard  k-means  algorithm. 

Hard  clustering  techniques  limit  a  document  to  a  single,  discrete  cluster  label.  However, 
documents  rarely  discuss  one  topic.  Because  of  this,  several  clustering  methods  exist  based 
on  assigning  documents  to  multiple  clusters.  Soft  clustering  refers  to  assigning  each  document 
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P(sim.) 


0.0  0.2  0.4  0.6  0.8  1.0 


similarity 

(a)  Good  Separation 


similarity 

(b)  Bad  Separation 


Figure  3.2.  Artificial  scenarios  where  the  cluster  hypothesis  holds  (a)  and  does  not  hold  (b). 
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Agglomerative  Clustering 


1 .  Compute  n  x  n  affinity  matrix, A 

2.  Find  the  entry  with  the  highest  similarity,  AVJ 

3.  Compute  the  similarity  between  all  clusters  and  the  cluster  formed  by  merging  i  and  j 
Place  similarities  in  Ait.,  A.j 

4.  Remove  Ajt.,  A.j 

5.  If  A  is  of  more  than  1  dimension,  goto  2 


Figure  3.3.  General  Agglomerative  Clustering  Algorithm. 


to  a  set  of  clusters.  In  the  vector  space  model,  a  clustering  can  be  produced  by  projecting 
documents  into  a  lower-dimensional  space  derived  by,  for  example,  soft  k-means  or  singular 
value  decomposition.  The  matrix  V  now  contains  real  valued  components  such  that  V%  3 
refers  to  the  degree  to  which  the  document  j  discusses  a  topic  i.  In  language  modeling, 
clusters  are  frequently  referred  to  as  topics  and  documents  are  represented  as  mixtures  of 
topic  language  models.  So,  probabilistic  semantics  are  attached  to  V  so  that  Vy  refers  to 
the  probability  that  a  document  discusses  a  topic,  P(z  =  i\d  =  j )  where  z  is  a  random 
variable  over  k  topics  and  d  is  a  random  variable  over  documents. 

One  drawback  of  both  hard  and  soft  clustering  is  that  partitions  are  considered  indepen¬ 
dent  of  each  other.  However,  inter-topic  relationships  are  often  assumed  to  be  hierarchical; 
i.e.,  topics  are  composed  of  subtopics,  subtopics  are  composed  of  subsubtopics,  et  cetera.  This 
hierarchical  structure  can  be  modeled  by  using  agglomerative  clustering.  Agglomerative  cluster¬ 
ing  iteratively  builds  a  hierarchy  of  clusters  by  merging  similar  documents  and  clusters.  We 
present  the  general  agglomerative  clustering  algorithm  in  Figure  3.3.  When  constructing  a 
hierarchy,  we  are  free  to  select  a  suitable  agglomeration  method  (Step  3).  Three  popular 
agglomeration  methods  include  single  link,  average  link,  and  Ward’s  method.  Single  link 
agglomeration  refers  to  computing  the  similarity  between  two  clusters  i  and  j  as  the  highest 
similarity  between  pairs  of  documents  spanning  i  and  j.  Average  link  agglomeration  refers 
to  computing  the  similarity  between  two  clusters  i  and  j  as  the  average  similarity  between 
pairs  of  documents  spanning  i  and  j.  Ward’s  method  refers  to  computing  the  similarity 
between  two  clusters  i  and  j  as  the  variance-weighted  average  similarity  between  pairs  of 
documents  spanning  i  and  j.  Agglomerative  clustering  results  in  a  hierarchy  referred  to  as 
a  dendrogram.  We  present  example  dendrograms  in  Figure  3.4.  Dendrograms  are  full  binary 
trees.  Therefore,  there  are  k  —  n  —  1  non-singleton  clusters  and  the  cluster  assignment  ma¬ 
trix,  V,  has  dimension  (n  —  1)  X  n.  In  general,  single  link  clustering  tends  to  result  in  long 
“straggly”  clusters,  while  average  link  and  Ward’s  method  tend  to  produce  more  compact. 
Ward’s  method,  because  of  the  variance  weighting,  creates  elliptical  clusters  with  a  relatively 
flatter  hierarchy. 
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(c)  Ward 


Figure  3.4.  Dendrograms  for  three  hierarchical  clustering  methods.  The  top  1000  doc¬ 
uments  retrieved  for  the  query  “hostage-taking”  were  hierarchically  clustered  according  to 
single  link,  average  link,  and  Ward’s  method. 
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3. 1.1. 2  Cluster-based  Retrieval 

If  we  assume  that  a  single,  distinct  cluster  of  relevant  documents  exists  for  a  query,  then 
we  may  want  to  develop  algorithms  that  find  and  retrieve  documents  from  the  relevant 
cluster  and  no  others.  This  is  one  type  of  cluster-based  retrieval. 

An  ancillary  data  structure  which  is  often  produced  from  a  clustering  algorithm  is  an 
m  X  k  matrix,  U,  which  represents  the  clusters  in  the  ambient  space.  In  the  vector  space 
model,  these  are  often  produced  by  averaging  the  ambient  representations  of  the  member 
documents.  These  averages  are  referred  to  as  the  cluster  centroids.  In  the  language  modeling 
framework,  the  columns  of  U  are  multinomials  referred  to  as  topic  language  models  (i.e., 
Uij  =  P{w  =  i\z  =  ])).  Other  approaches  include  representing  clusters  by  a  single 
document  typical  of  the  cluster.  Typicality  here  could  refer  to  the  document  in  the  densest 
region  of  the  cluster.  These  documents  are  referred  to  as  the  cluster  mediods.  The  matrix  U 
allows  us  to  score  clusters.  In  the  vector  space  model,  we  can  rank  each  cluster  i  according 
to  cos(U.,i,  q).  Using  language  models,  we  rank  each  cluster  i  according  to  (logtTJ.^),  q). 

When  ranking  clusters,  Jardine  and  van  Rijsbergen  found  that  incorporating  the  size  of 
the  cluster  into  ranking  function  improved  effectiveness  [1971],  Specifically,  Jardine  and  van 
Rijsbergen  use  binary  term  vectors  where  the  component  value  is  1  if  the  within-cluster  doc¬ 
ument  frequency  is  greater  than  log  C  where  Gj  is  the  size  of  the  cluster.  Croft  used  scalar 
indexing  by  computing  smoothed,  within-cluster  document  relative  frequencies  [1980], 

We  can  use  cluster  scoring  in  order  to  guide  the  retrieval  process.  The  simplest  retrieval 
method  considers  only  the  top-ranked  cluster.  Let  c  represent  the  top-ranked  cluster.  We 
can  then  retrieve  documents  according  to, 

1.  the  set  defined  by  {i  |UC)j  >  0} 

2.  the  ranking  of  documents  in  {/’ |  VCJ  >0}  according  to  dTq 

3.  the  ranking  defined  by  the  row  Vc>. 

Early  work  demonstrated  that  using  this  technique  with  hard  clustering  hurt  effectiveness; 
retrieved  documents  often  included  many  non-relevant  documents  [Salton,  1971],  In  the 
context  of  single  link  hierarchical  clustering,  Jardine  and  van  Rijsbergen  showed  that  ranking 
all  k  clusters  and  retrieving  a  set  of  documents  improved  the  effectiveness  of  search  over  non¬ 
cluster  techniques  for  high  precision  evaluation  [197 1] . 

If  we  treat  each  non-leaf  node  of  the  dendrogram  as  a  retrievable  cluster,  then  we  can 
exploit  the  hierarchy  in  order  to  search  for  the  top-ranked  cluster.  Jardine  and  van  Rijsber¬ 
gen  proposed  searching  for  the  top-ranked  cluster  by  scoring  clusters  starting  at  the  root  of 
the  dendrogram  and  stopping  when  scores  stopped  increasing  [1971],  We  show  this  graph¬ 
ically  in  Figure  3.5.  This  demonstrated  effectiveness  similar  to  performing  a  global  search 
for  the  top-ranked  cluster. 

These  results  in  top-down  search  used  the  size-penalized  version  of  cluster-ranking  bi¬ 
asing  retrieved  clusters  toward  those  which  occur  near  the  bottom  of  the  hierarchy.  Croft 
proposed  a  bottom-up  cluster  search  method  which  ranks  the  set  of  clusters  with  any  leaf 
children  [1980],  This  method  outperformed  top-down  searches  and  outperformed  non¬ 
cluster  techniques  for  high  precision  evaluation. 
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bottom  non-leaf  nodes  of  the  dendrogram. 


The  success  of  bottom-up  searches  may  be  alarming  because  it  suggests  two  possible 
issues  with  hierarchical  clustering.  First,  matching  queries  against  higher-level  clusters  may 
not  adequately  incorporate  information  important  to  the  performance  measure.  For  exam¬ 
ple,  consider  the  representation  of  a  cluster  containing  a  pair  of  closely-related  relevant  doc¬ 
uments.  In  the  vector  space  model,  this  cluster  might  be  represented  at  retrieval  time — as 
opposed  to  clustering  time — by  the  linear  combination  of  component  document  vectors. 
The  introduction  of  random,  non-relevant  documents  will  not,  on  average,  affect  the  cen¬ 
troid.  But  these  are  precisely  the  types  of  clusters  we  expect  to  find  at  the  top  of  the  hierarchy. 
Since  both  centroids  are  the  same,  and  since  higher  level  clusters  may  contain  non-relevant 
documents,  we  may  expect  queries  matching  lower  level  clusters  to  perform  better.  A  sec¬ 
ond  observation  suggested  by  the  success  of  bottom-up  searches  is  that  the  agglomeration 
process  is  unsatisfactory.  Regardless  of  how  well  we  represent  clusters  of  documents,  if  the 
underlying  process  does  not  group  together  similar  documents,  then  effective  cluster-based 
retrieval  is  very  difficult. 

These  concerns  were  partially  addressed  by  the  work  of  Griffiths  et  al.  which  demon¬ 
strated  that  different  agglomeration  methods  could  be  used  to  address  the  problems  with 
single  link  clusters  [1986],  In  the  context  of  bottom-up  searches,  agglomeration  by  aver¬ 
age  link  and  Ward’s  method  were  shown  to  consistently  outperform  single  link  clusters.  We 
can  refer  to  Figure  3.4  to  better  understand  the  difference  between  single  link  clusters  and 
these  other  two  methods.  Clusters  in  the  the  single  link  hierarchy  are  taller,  illustrating  the 
greedier  fashion  in  which  clusters  are  formed.  Recall  also  that  this  greedy  clustering  results 
in  straggly  clusters  that  potentially  span  relatively  large  areas  of  space.  Average  link  and 
Ward’s  method  clusters  are  much  shorter,  indicating  that  they  tend  to  be  localized  in  com¬ 
pact  regions  of  space.  The  success  of  these  localized  methods  may  indicate  that  relevant 
documents  tend  to  be  isolated  to  small  areas  of  space,  as  opposed  to  large  or  stringy  areas. 

3. 1 . 1 .3  Document  Expansion 

Document  vectors,  despite  being  estimated  from  long  text  samples,  still  suffer  from  spar¬ 
sity.  When  the  sparsity  is  systematic  within  the  topic  of  the  document  we  run  the  risk  of 
missing  the  document  during  retrieval.  For  example,  if  the  author  never  uses  the  word  “ca¬ 
nine”  in  a  document  about  “dogs”,  then  the  system  will  always  miss  this  document  when 
a  user  queries  “dog”.  Salton  recognized  that  a  document’s  representation  should  include 
terms  occurring  in  topically-related  documents  [Salton,  1968], 

The  value  of  a  document  is  assumed  to  be  a  linear  function  of  the  values  of  the 
terms  it  contains  as  well  of  the  values  of  the  associated  documents. 

We  refer  to  this  process  as  document  expansion.  When  the  topical  relationships  are  based  on 
clusters,  we  refer  to  this  process  as  cluster-based  document  expansion.  Originally  proposed  in  the 
context  of  language  modeling  [Liu  and  Croft,  2004],  an  expanded  document  representation 
can  be  formulated  as, 

k 

p(w \ed)  =  \P{w\6d)  +  (1  -  X)Y/P(w\9i)c(d,i)  (3.1) 

i= 1 
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where  c(d,  i )  is  an  indicator  function  whose  value  is  1  if  d  belongs  to  cluster  i  and  0*  is  the 
topic  model  i.  When  documents  are  composed  of  multiple  topics,  Wei  and  Croft  [Wei  and 
Croft,  2006]  proposed  the  following  expanded  document  model, 

k 

P(w\ed)  =  \P{w\9d)  +  (1  -  \)Y,P(w\eci)P(z  =  i\d)  (3.2) 

i= 1 

where  P{z  =  i\d)  is  derived  from  U,  computed  by  using  Blei  and  Lafferty’s  Latent  Dirichlet 
Allocation  clustering  technique  [Blei  et  al.,  2003],  Hoffman  [Hofmann,  1999]  studied  this 
expansion  method  the  special  case  of  A  =  0  when  using  the  Probabilistic  Latent  Semantic 
Analysis  clustering  approach. 

3.2  The  Voorhees  Test 

Voorhees,  concerned  with  the  disparity  between  the  size  of  RR  and  NR  in  Figure  3.2, 
suggested  an  alternative  test  [Voorhees,  1985],  Instead  of  looking  at  the  distribution  of 
similarities,  Voorhees  measured  the  density  of  relevant  documents  near  other  relevant  doc¬ 
uments.  That  is,  for  each  relevant  document,  we  will  look  at  its  k  nearest  neighbors  and 
compute  what  we  will  refer  to  as  the  local  precision.  For  example,  if  k  —  5,  for  each  relevant 
document,  we  look  at  its  five  closest  documents;  the  local  precision  is  the  number  of  relevant 
documents  divided  by  five.  Voorhees  then  tests  the  following  hypothesis 

Hypothesis  3.2.  Relevant  documents  have  high  local  precision. 

Graphic  representations  of  local  precision  distributions  originally  presented  in  [Voorhees, 
1985]  are  displayed  in  Figure  3.6.  By  qualitative  inspection,  Voorhees  argued  that  the  MED 
collection  satisfies  Hypothesis  3.2  because  relevant  documents  tend  to  be  related  to  other 
relevant  documents.  Relevant  documents  in  the  CACM,  CISI,  and  INSPEC  collections  in 
Figure  3.6,  however,  tend  to  be  isolated  from  each  other,  implying  that  Hypothesis  3.2  is 
not  supported  for  these  three  collections. 

3.2.1  Approaches  Motived  by  the  Voorhees  Test 

The  Voorhees  test  has  received  much  less  attention  in  the  information  retrieval  literature 
than  the  Jardine-van  Rijsbergen  test.  In  this  section,  we  will  review  several  approaches  which 
assume  that  queries  satisfy  Hypothesis  3.2. 

3.2. 1.1  Multiple-Cluster  Retrieval 

The  Jardine-van  Rijsbergen  test  motivated  retrieving  a  single  cluster  from  some  set  of 
clusters.  If  searched  top-down,  then  retrieved  clusters  may  include  many  non-relevant  doc¬ 
uments.  If  searched  bottom-up,  then  retrieved  clusters  may  have  higher  precision  but  lower 
recall.  The  Voorhees  test  suggests  that  relevant  documents  potentially  occur  in  isolated, 
locally-dense  clusters.  Recall  that  we  can  capture  the  local  density  by  searching  bottom-up, 
as  suggested  by  Croft.  We  can  retrieve  disparate  clusters  by  simply  retrieving  and  merging 
documents  from  multiple  clusters  instead  of  just  one.  Voorhees  proposed  ranking  the  bot¬ 
tom  level  single  link  clusters  in  a  dendrogram  and  retrieving  the  top-ranking  document  from 
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Figure  3.6.  Local  precision  cluster  hypothesis  test  for  four  collections  presented  in 
[Voorhees,  1985],  For  each  relevant  document,  we  compute  the  number  of  relevant  docu¬ 
ments  in  its  five  nearest  neighbors;  we  refer  to  this  as  the  local  precision.  According  to  this 
measure,  the  MED  collection  exhibits  high  clustering;  relevant  documents  tend  to  be  near 
other  relevant  documents.  On  the  other  hand,  relevant  documents  in  other  collections  tend 
be  surrounded  by  fewer  relevant  documents 
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each  cluster  [1985],  This  approach  improved  the  effectiveness  of  the  retrieved  set  of  docu¬ 
ments.  Griffiths  et  al.  demonstrated  similar  improvements  when  retrieving  multiple  average 
link  and  Ward’s  method  clusters  [1986], 

One  conclusion  we  can  draw  from  the  increasing  effectiveness  as  we  move  from  top- 
down  to  bottom-up,  from  single  link  to  Ward’s  method,  and  from  single  cluster  to  multiple 
cluster  retrieval  is  that  relevance  tends  to  be  supported  by  many,  small,  tight  clusters  as 
opposed  to  larger,  stragglier  clusters.  Griffiths  et  al.  took  this  to  an  extreme  and  considered 
clusters  consisting  of  only  pairs  of  nearest-neighbors  [  1 986] .  This  algorithm  performed 
multiple  cluster  retrieval  using  clusters  consisting  of  pairs  of  documents.  Not  only  did  this 
clustering  outperform  all  other  cluster-based  retrieval  methods  but  it  also  outperformed 
non-cluster  techniques  for  some  collections.  These  results  were  confirmed  in  the  context  of 
language  modeling  [Kurland  and  Lee,  2004], 

3.2. 1.2  Spreading  Activation 

Griffiths  et  al.  took  cluster-based  retrieval  to  one  extreme  by  retrieving  small  nearest- 
neighbor  clusters.  The  success  of  this  representation  and  algorithm  demonstrates  a  pro¬ 
gression  toward  increasingly  local  analysis  of  relevance.  In  this  section,  we  will  describe 
spreading  activation,  a  method  which,  although  predating  some  of  the  cluster-based  retrieval 
approaches,  can  be  seen  as  a  philosophical  descendent  of  cluster-based  retrieval  methods. 
We  believe  that  spreading  activation  assumes — though  never  explicitly  tests — the  Voorhees 
test  to  be  true. 

Spreading  activation  refers  to  the  technique  of  propagating  relevance  information  between 
topically  related  documents,  represented  by  the  matrix  W  defined  earlier. 1  Spreading  acti¬ 
vation  usually  does  no  explicit  clustering  and  uses  only  pairwise  relationships  between  doc¬ 
uments.  The  algorithm  is  usually  initialized  by  attaching  a  relevance  value  to  the  nodes  of 
the  graph  either  as  explicit  binary  labels  or  as  scores  from  some  initial  retrieval.  Each  node 
in  the  graph  then  recomputes  its  score  by  inspecting  the  relevance  values  of  its  neighbors. 
This  process  is  iterated  until  convergence.  Several  propagation  rules  have  been  studied  in 
the  spreading  activation  literature.  The  updated  score  for  a  document  might  be  the  max¬ 
imum  score  of  related  documents,  the  average,  or  some  other  aggregating  function.  The 
relationship  to  the  Voorhees  test  should  be  clear.  Spreading  activation  makes  the  implicit 
assumption  that  the  relevance  label  of  each  node  is  related  to  the  relevance  label  of  topically 
related  documents. 

The  original  spreading  activation  model  proposed  by  Preece  used  manually-built  graphs 
with  multiple  link  types  [1981],  The  assumption  behind  manually-built  graphs  is  that  the 
document  scores  should  be  correlated  by  the  manual  information;  certainly  we  could  think 
of  manual  labels  which  would  not  indicate  a  correlation.  When  suitable  manual  links  are 
absent,  topical  relationships  can  be  approximated  by  the  methods  presented  in  Section  2.3. 
Croft’s  work  in  network-based  collection  representations  studied  combinations  of  manual 
and  automatic  relationships.  For  example,  the  i'll  system  combined  citation  and  automatic 

'in  general,  the  graph  may  consist  of  heterogeneous  nodes  including  documents,  terms,  metadata,  and 
concepts.  We  will  focus  on  homogeneous  graphs  where  nodes  correspond  to  documents  and  edges  correspond 
to  some  relationship  between  documents.  For  general  spreading  activation,  readers  should  consult  Crestani’s 
survey  of  classic  spreading  activation  models  [Crestani,  1997], 
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links  [Croft  et  al.,  1988],  The  results  indicate  that  in  many  situations,  tf.idf  baselines  could 
be  improved  by  propagating  relevance  information  over  both  automatic  and  citation-based 
links.  The  latter  were  shown  to  be  superior  to  automatic  links.  Croft  and  Turtle  extended 
the  description  of  the  I3R  system  to  consider  hypertext  as  well  citations  [1989],  Both  types 
of  links  were  demonstrated  to  improve  performance,  although  hyperlinks  provided  a  larger 
improvement.  Similar  experiments  using  variations  of  propagation  methods,  baseline  al¬ 
gorithms,  and  similarity  functions  have  shown  similar  improvements.  More  recent  results 
in  the  context  of  web  search  replicate  these  results  for  newer  collections  [Qin  et  al.,  2005; 
Shakery  and  Zhai,  2006] . 

3.2. 1.3  Local  Document  Expansion 

Salton  originally  described  document  expansion  in  the  context  of  the  vector  space  model 
and  document  vectors  were  interpolated  with  nearest-neighbors’  vectors.  We  refer  to  this 
process  as  local  document  expansion  because  there  is  no  explicit  clustering  performed.  Also  oper¬ 
ating  within  the  vector  space  model,  Singhal  and  Pereira  applied  local  document  expansion 
in  order  to  reduce  noise  in  collections  of  speech-recognized  documents  [Singhal  and  Pereira, 
1999],  In  the  language  modeling  framework,  Ogilvie  originally  proposed  nearest-neighbor 
smoothing  [Ogilvie,  2003]  while  Kurland  and  Lee  rigorously  evaluated  it  [Kurland  and 
Lee,  2004] .  This  approach  has  also  been  used  in  hypertext  collections  for  propagating  term 
weights  across  hyperlinks  [Qin  et  al.,  2005], 

3.3  Jardine-van  Rijsbergen  or  Voorhees? 

The  critical  difference  between  Hypothesis  3.1  and  Hypothesis  3.2  lies  in  the  assump¬ 
tion  each  makes  about  the  set  of  relevant  documents.  Specifically,  Hypothesis  3.1  assumes 
that  there  is  a  single,  coherent  relevant  cluster  while  Hypothesis  3.2  only  assumes  that  rel¬ 
evant  documents  have  high  local  precision.  The  implication  of  this  latter  assumption  can 
be  demonstrated  by  inspecting  the  behavior  of  the  measurements  as  the  size  of  the  relevant 
document  set  grows  for  different  numbers  of  relevant  clusters.  In  Figure  3.7,  we  show  ar- 
tihcial  2-dimensional  data  produced  to  represent  a  single  relevant  cluster  in  the  midst  of 
non-relevant  data.  Notice  that  as  the  size  of  the  relevant  cluster  grows,  both  Hypothesis  3. 1 
and  Hypothesis  3.2  receive  increasing  support.  However,  if  we  have  several  distinct-but- 
separated  clusters,  we  observe  very  different  behavior.  In  Figure  3.8,  we  present  artificial 
data  exhibiting  four  relevant  clusters.  Because  relevant  documents  do  not  exist  in  a  sin¬ 
gle  cluster,  the  RR  and  NR  distributions  are  difficult  to  distinguish.  This  effect  is  produced 
because  the  RR  distribution  includes  distances  between  documents  in  different  relevant  clus¬ 
ters.  As  the  number  of  relevant  clusters  (and  sample  from  each)  grows,  the  RR  distribution 
will  begin  to  include  more  pairs  documents  with  low  similarity.  Hypothesis  3.2,  because  it 
focuses  more  on  the  local  behavior,  is  well-supported,  regardless  of  the  number  of  relevant 
clusters.2 

2The  presence  of  several,  distinct  clusters  is  not  exceptional.  The  TREC  interactive  track,  for  example, 
studied  queries  consisting  of  several  aspects  [Over,  1996],  Leouski  and  Allan,  studying  interactive  retrieval 
and  visualization,  noted  that  the  relevant  document  set  is  likely  to  include  multiple  clusters  [Leouski  and 
Allan,  1998], 
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Gaussian;  non-relevant  points  are  sampled  uniformly. 


Figure  3.9.  Two  relevant  cluster  of  non-uniform  density.  The  first  subfigure  shows  fifty 
relevant  points  and  150  non-relevant  points  in  two  dimensions.  The  second  subfigure  shows 
the  distributions  of  similarities  between  relevant  documents  (RR)  and  relevant  and  non- 
relevant  documents  (NR).  The  third  subfigure  shows  the  distribution  of  local  precision. 


Another  problem  with  testing  Hypothesis  3.1  occurs  when  the  similarity  measure  ex¬ 
hibits  non-uniform  behavior  between  topically  related  documents.  Consider  the  artificial 
example  in  Figure  3.9.  Non-relevant  documents  exhibit  very  tight  clustering  while  the  rele¬ 
vant  set  of  documents  is  more  diffuse.  The  effect  is  that  the  similarities  in  the  submatrix  RR 
tend  to  be  much  larger  than  the  similarities  in  the  submatrix  NR.  Again,  the  local  precision 
is  more  robust  to  this  non-uniformity  because  the  nearest-neighbor  criterion  is  adaptive. 

So  far,  we  have  used  artificial  examples  when  comparing  the  two  methods  of  testing 
the  cluster  hypothesis.  Therefore,  we  performed  the  same  measurements  for  two  sets  of 
queries  used  in  this  thesis:  ql/trecl2  and  ql/robust  (see  Appendix  A  for  details).  In  these 
experiments,  we  microaveraged  similarities  and  local  precisions  over  a  set  of  150  queries  for 
tree  12  and  250  queries  for  robust.  The  set  of  queries  represented  by  tree  12  are  considered  by 
the  community  to  be  relatively  easier  to  satisfy  than  those  in  robust.  We  present  our  results 
in  Figure  3.10.  Qualitatively,  we  do  not  see  compelling  evidence  for  Hypothesis  3.1  in  either 
collection.  Although  we  would  like  there  to  be  more  evidence  for  the  cluster  hypothesis  in  the 
easier  collection,  there  also  does  not  seem  to  be  a  difference  in  the  overlap  of  the  distributions 
for  the  two  collections.  Hypothesis  3.2,  on  the  other  hand,  is  well-supported  for  the  tree  12 
collection.  The  robust  collection  tends  to  include  many  more  isolated  relevant  documents 
which  is  consistent  with  our  impression  that  it  is  more  difficult.  Hence,  the  Voorhees  test  not 
only  correctly  detects  clustering  in  both  corpora  (more  than  half  of  the  relevant  documents 
have  a  local  precision  of  >  0.50),  but  also  distinguishes  between  these  two  collections. 

3.4  Summary 

Vector  space  representations  of  text  are  fundamentally  mysterious  because  of  their  high- 
dimensionality.  We  cannot  visually  inspect  patterns  of  points  in  the  ambient  space.  We  can 
try  to  visualize  this  by  projecting  vectors  into  two  or  three  dimensions  hut  must  acknowledge 
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Figure  3.10.  Microaveraged  values  of  the  Jardine-van  Rijsbergen  and  Voorhees  tests  for 
baseline  retrievals  using  the  tree  12  and  robust  collections.  The  Jardine-van  Rijsbergen  test 
implies  that  relevant  documents  in  both  collections  are  poorly-separated  from  non-relevant 
documents;  it  also  does  not  distinguish  between  the  degree  of  separation  between  these 
two  collection.  The  Voorhees  measure  indicates  that  the  relevant  documents  in  the  tree  12 
collection  tend  to  be  related  to  other  relevant  documents;  this  property  is  not  as  apparent  in 
the  robust  collection. 
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that  a  huge  amount  of  information  is  lost  in  the  process.  If  we  want  to  make  statements  about 
the  collection,  we  can  only  do  so  by  measuring  properties  of  the  affinity  matrix.  Although 
both  the  Jardine-van  Rijsbergen  and  Voorhees  tests  accomplish  this,  we  adopt  the  Voorhees 
test  because  it  introduces  fewer  assumptions  about  the  behavior  of  relevant  documents  and 
the  uniformity  of  the  similarity  measure.  In  the  next  chapter,  we  will  adapt  the  Voorhees 
test  to  measure  the  local  consistency  of  scores. 
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CHAPTER  4 


AUTOCORRELATION  OF  RETRIEVAL  SCORES 


In  order  to  provide  evidence  for  the  Voorhees  hypothesis,  we  demonstrated  that  relevant 
documents  tended  to  be  situated  near  other  relevant  documents.  In  this  chapter,  we  will  be 
relaxing  the  Voorhees  hypothesis  to  make  a  statement  about  retrieval  scores.  We  will  be 
testing  the  following  hypothesis, 

Hypothesis  4.1.  Given  a  set  of  retrieval  scores,  the  local  consistency  of  the  scores  correlates  positively 
with  retrieval  performance. 

In  other  words,  a  good  retrieval  tends  to  score  topically-related  documents  consistently. 
We  will  develop  a  test  which  measures  the  degree  to  which  a  retrieval  system  exhibits  local 
consistency.  Like  the  tests  in  Chapter  3,  our  approach  will  use  the  inter-document  similarity 
matrix.  Unlike  these  tests,  we  will  use  a  vector  of  retrieval  scores,  y,  defined  in  Section  2.2 
instead  of  relevance  judgments. 

In  this  chapter,  we  will  argue  that  local  score  consistency  is  an  important  predictor  of 
the  performance  of  a  set  of  retrieval  scores.  Our  approach  is  similar  to  Fang  et  a/’s  study  of 
heuristics  used  in  information  retrieval  [Fang  et  al.,  2004,  p.  49;  emphasis  added], 

We  formally  define  a  set  of  basic  desirable  constraints  that  any  reasonable  retrieval  formula 
should  satisfy,  and  check  these  constraints  on  a  variety  of  retrieval formulas ,  which  respec¬ 
tively  represent  the  vector  space  model  (pivoted  normalization),  the  classic  prob¬ 
abilistic  retrieval  model  (Okapi),  and  the  recently  proposed  language  modeling 
approach  (Dirichlet  prior  smoothing).  We  find  that  none  of  these  retrieval  for¬ 
mulas  satisfies  all  the  constraints  unconditionally,  though  some  formulas  violate 
more  constraints  or  violate  some  constraints  more  “seriously”  than  others.  Em¬ 
pirical  results  show  that  when  a  constraint  is  not  satisfied,  it  often  indicates  non-optimality  of 
the  method,  and  when  a  constraint  is  satisfied  only  for  a  certain  range  of  parameter  values,  its 
performance  tends  to  be  poor  when  the  parameter  is  out  of  the  range.  In  general,  we  find,  that 
the  empirical  performance  of  a  retrieval  formula  is  tightly  related  to  how  well  it  satisfies  these 
constraints.  Thus  the  proposed  constraints  provide  a  good  explanation  of  many 
empirical  observations  about  retrieval  methods.  Moreover,  these  constraints 
make  it  possible  to  evaluate  any  existing  or  new  retrieval  formula  analytically 
and  suggest  how  we  may  further  improve  a  retrieval  formula. 

We  will  demonstrate  that  Hypothesis  4.1,  like  the  heuristics  studied  by  Fang  et  al,  suggests  a 
property  that  retrieval  systems  should  incorporate  by  design. 


37 


4.1  Testing  the  Cluster  Hypothesis  without  Relevance 
Judgments 

There  is  a  growing  body  of  work  that  studies  the  correlation  of  the  performance  of  in¬ 
dividual  retrievals  with  the  degree  to  which  a  retrieval  is  clustered.  In  this  section,  we  will 
review  two  of  these  approaches. 


4.1.1  Clarity 

Clarity  measures  the  extent  to  which  vocabulary  is  shared  in  the  top  n  retrieved  docu¬ 
ments  [Cronen-Townsend  et  al.,  2006],  The  conjecture  is  that,  in  a  good  retrieval,  the  most 
frequent  words  are  topically  coherent.  A  bad  retrieval  would  include  documents  on  many 
disparate  topics;  the  most  frequent  terms  would  be  terminological  noise. 

Clarity  measures  the  similarity  of  the  most  frequent  words  in  retrieved  documents  to 
the  most  frequent  words  used  in  the  whole  corpus.  We  refer  to  the  frequency  of  terms  in 
the  whole  corpus  as  the  background  frequency.  The  frequent  terms  in  a  good  retrieval  will 
be  distinct  from  the  background;  the  frequent  terms  in  a  bad  retrieval  will  be  similar  to 
the  background.  In  the  context  of  language  modeling,  we  can  compute  a  representation  of 
the  language  used  in  the  initial  retrieval  as  a  weighted  combination  of  document  language 
models, 

q=^CTy  (4.1) 

llylli 


where  y  represent  the  document  query  likelihoods  (Equation  2.11).  In  order  to  model  “gen¬ 
eral  text”,  we  use  corpus-level  statistics.  The  assumption  here  is  that  a  language  model  of 
the  entire  corpus  will  naturally  converge  on  non-specific  terminology. 

C  =  (4-2) 

1 1  v-rf  6  1 1 1 

We  can  compare  q  with  c  using  any  of  the  methods  from  Section  2.3.2.  For  example, 
we  can  use  the  Kullback-Leibler  divergence,  D^l (q|| c),  or  the  Jensen-Shanon  divergence, 
JSW( q,  c). 

When  retrievals  are  not  based  on  language  modeling,  Equation  4. 1  can  be  adjusted  to  be 
a  function  of  document  ranks  instead  of  scores  [Cronen-Townsend  et  al.,  2006],  Ranked-list 
Clarity  converts  document  ranks  to  P(Q\9i)  values.  This  conversion  begins  by  replacing  all 
of  the  scores  in  y  with  the  respective  ranks.  Our  estimation  of  P{Q\9i)  from  the  ranks  is, 


y  = 


2(c+l-yj) 
c(c+ 1) 

0 


if  y,  <  c 

otherwise 


(4.3) 


where  c  is  a  cutoff  parameter.  We  estimate  the  query  language  model  using  Equation  4.1. 


4.1.2  Cox-Lewis  Statistic 

Assume  we  have  a  set  of  documents  retrieved  for  our  query  and  7Z  represents  the  indexes 
of  the  top  n  documents.  Another  way  to  quantify  the  dispersion  of  a  set  of  documents  is  to 
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inspect  the  similarity  between  documents  in  1Z.  In  spatial  data  analysis,  the  Cox-Lewis 
statistic  measures  the  expected  distance  between  a  group  of  points.  In  the  case  of  the  set 
7Z,  distances  are  computed  in  the  m-dimensional  embedding  space.  We  hypothesize  that 
a  good  retrieval  will  return  a  single,  tight  cluster.  A  poorly  performing  retrieval  will  return 
a  loosely  related  set  of  documents  covering  many  topics.  The  method  of  quantifying  this 
dispersion  is  to  measure  the  distance  from  a  random  document  a  to  its  nearest  neighbor,  b. 
A  retrieval  which  is  tightly  clustered  will,  on  average,  have  a  low  distance  between  a  and  6; 
a  retrieval  which  is  less  tightly-closed  will,  on  average  have  high  distances  between  a  and  b. 
This  average  corresponds  to  using  the  Cox-Lewis  statistic  to  measure  the  randomness  of  the 
top  n  documents  retrieved  from  a  system  [Vinay  et  al.,  2006].  It  is  important  to  notice  that 
the  Cox-Lewis  statistic  throws  away  information  about  the  retrieval  function  y.  This  makes 
the  Cox- Lewis  statistic  highly-dependent  on  selecting  the  top  n  documents. 


4.2  Autocorrelation 


In  this  section,  we  will  be  deriving  a  measure  of  local  score  consistency  from  the  Voorhees 
test.  The  Voorhees  test  computes  the  local  precision  of  each  relevant  document.  Let  r  G 
{0,  l}n  be  the  vector  of  relevance  judgments  and  W  be  the  nearest-neighbor  matrix  de¬ 
fined  in  Section  2.3.3  where  rows  are  Li  normalized  such  that  We  =  e.  We  can  construct 
the  Voorhees  histogram  by  computing  the  vector  Wr  and  then  inspecting  the  entries  corre¬ 
sponding  to  relevant  documents.  The  histogram  used  in  the  Voorhees  test  provides  a  nice 
visualization  of  the  distribution  of  local  precision  but  can  be  summarized  by  a  single  num¬ 
ber.  For  example,  we  can  compute  the  mean  of  the  local  precisions  of  relevant  documents. 
Noticing  that  rTr  represents  the  number  of  relevant  documents,  the  mean  can  be  computed 
as 

yWr  =  E ijWim 

rTr  E  ,r?  1  '  ’ 

and  is  more  generally  referred  to  as  the  Rayleigh  quotient  in  mathematics.  We  are  interested 
in  measuring  the  similarity  between  the  scores  in  the  absence  of  relevance  information.  Our 
approach  will  be  to  replace  the  binary  vector,  r,  with  the  score  vector  y.  Under  the  same 
row  normalization  assumption  of  Equation  4.4, 


yTWy  _  E i,j  WjjyiVj 

yTy  “  E ivl. 


(4.5) 


which  is  referred  to  as  the  Moran  autocorrelation  in  spatial  data  analysis  [Cliff  and  Ord, 
1973;  Griffith,  2003], 

For  arbitrary  y  and  fixed  W,  the  Rayleigh  quotient  is  bound  by  above  by  the  largest 
eigenvalue  of  W.  However,  recall  that  we  will  be  computing  a  different  W  for  each  retrieval 
using  the  top  h  documents.  Therefore,  the  expected  range  of  the  value  in  Equation  4.5  is 
dependent  on  the  matrix  W  and  vector  y.  This  is  problematic  since  we  would  like  to  com- 
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pare  autocorrelation  values  for  different  retrievals.  Therefore,  we  use  the  Cauchy-Schwartz 
inequality  to  establish  a  bound  on  the  autocorrelation, 


yTWy  yTWTWy 

yTy  ““  \|  yTy 


Dividing  Equation  4.5  by  this  bound,  we  define  the  normalized  spatial  autocorrelation  as 


Im  — 


yTWy 


yTy  X  yTWTWy 


(4.6) 


where  we  adopt  the  standard  notation,  Im,  for  the  Moran  autocorrelation. 

In  Section  2.3.3,  we  indicated  that  the  nearest-neighbor  matrix,  W,  could  be  visualized 
as  a  graph.  A  set  of  scores,  y,  can  be  represented  by  coloring  each  node  of  the  graph  ac¬ 
cording  to  its  score.  We  present  examples  of  document  graphs  with  score -based  coloring 
in  Figure  4.1.  In  order  to  accent  the  locality  of  scores,  we  also  colored  edges  with  a  gradi¬ 
ent  which  transitions  between  the  colors  associated  with  the  scores  of  the  nodes  on  either 
end  of  the  edge.  The  graph  of  a  retrieval  with  high  autocorrelation  (Figure  4.1a)  consists 
of  edges  with  more  solid  colors,  resulting  in  clear  contrast  between  high-scoring  and  low- 
scoring  regions  of  the  graph.  The  graph  of  a  retrieval  with  low  autocorrelation  (Figure  4.1b) 
consists  of  edges  with  sharper  gradients,  resulting  muddier  contrast  between  high-scoring 
and  low-scoring  regions  of  the  graph. 


4.3  Experiments 

Our  experiments  focus  on  testing  the  ability  of  autocorrelation  to  predict  the  perfor¬ 
mance  of  a  retrieval.  As  stated  in  the  introduction  of  this  chapter,  we  are  interested  in  pre¬ 
dicting  the  performance  of  the  retrieval  generated  by  an  arbitrary  system.  Our  methodology 
is  consistent  with  previous  research  in  that  we  predict  the  relative  performance  of  a  retrieval  by 
comparing  a  ranking  based  on  our  predictor  to  a  ranking  based  on  performance  as  measured 
by  average  precision. 

We  present  results  for  two  sets  of  experiments.  The  first  set  of  experiments  presents 
detailed  comparisons  of  our  predictors  to  previously-proposed  predictors  using  their  data 
sets.  Our  second  set  of  experiments  demonstrates  the  generalizability  of  our  approach  to 
arbitrary  retrieval  methods,  corpus  types,  and  corpus  languages. 

4.3.1  Detailed  Experiments 

In  our  detailed  experiments,  we  will  predict  the  performance  of  language  modeling 
scores  using  our  autocorrelation  predictor.  We  use  retrievals,  values  for  baseline  predic¬ 
tors,  and  evaluation  measures  reported  in  previous  work  [Zhou  and  Croft,  2006] .  This  will 
allow  us  to  compare  the  magnitude  of  our  correlations  with  previously-published  results. 

These  performance  prediction  experiments  use  language  model  retrievals  performed  for 
queries  associated  with  collections  in  the  TREC  corpora.  Using  TREC  collections  allows  us 
to  confidently  associate  an  average  precision  with  a  retrieval.  In  these  experiments,  we  use 
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(a)  a  retrieval  with  high  autocorrelation  (“dismantling  Europe’s  arsenal”) 


(b)  a  retrieval  with  low  autocorrelation  (“Export  Controls  Cryptography”) 

Figure  4.1.  Retrieval  functions  on  the  document  graph.  We  constructed  a  nearest-neighbor 
document  graph  for  the  top  1000  documents  from  a  retrieval.  Edges  were  colored  by  a 
gradient  based  on  the  relevance  of  each  connected  document.  High  retrieval  scores  are 
associated  with  red.  Low  retrieval  scores  are  associated  with  grey. 
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the  following  topic  collections:  TREC  4  ad  hoc,  TREC  5  ad  hoc,  Robust  2004,  Terabyte 
2004,  and  Terabyte  2005. 

We  provide  two  baselines.  Our  first  baseline  is  the  classic  Clarity  predictor  presented 
in  Equation  4. 1 .  Clarity  is  the  theoretically-appropriate  predictor  for  language  modeling 
systems.  Our  second  baseline  is  Zhou  and  Croft’s  “ranking  robustness”  predictor  [Zhou  and 
Croft,  2006] .  This  predictor  corrupts  the  top  k  documents  from  retrieval  and  re-computes 
the  language  model  scores  for  these  corrupted  documents.  The  value  of  the  predictor  is  the 
Spearman  rank  correlation  between  the  original  ranking  and  the  corrupted  ranking.  In  our 
tables,  we  will  label  results  for  Clarity  using  D\l  and  the  ranking  robustness  predictor  using 
P. 

In  addition  to  single -predictor  experiments,  we  experimented  with  the  linear  combina¬ 
tion  of  predictors.  We  optimized  the  linear  regression  using  the  square  root  of  each  predictor. 
We  found  that  this  substantially  improved  fits  for  all  predictors,  including  the  baselines.  We 
considered  linear  combinations  of  pairs  of  predictors  (labeled  by  the  components)  and  all 
predictors  (labeled  as  /?). 

4.3.2  Generalizability  Experiments 

Autocorrelation  does  not  require  a  particular  baseline  retrieval  system;  the  predictor 
can  be  computed  for  an  arbitrary  retrieval,  regardless  of  how  scores  were  generated.  In  a 
second  set  of  experiments,  we  demonstrate  the  generalizability  of  our  results  to  a  variety  of 
collections,  topics,  and  retrieval  systems. 

We  gathered  a  diverse  set  of  collections  from  TREC  corpora.  We  cast  a  wide  net  in  order 
to  locate  collections  where  our  predictors  might  fail.  Our  hypothesis  is  that  topically-related 
documents  should  have  similar  scores.  Therefore,  we  avoided  collections  where  scores  were 
unlikely  to  be  correlated  (eg,  question-answering)  or  were  likely  to  be  negatively  correlated 
(eg,  diverse  ranking).  Nevertheless,  our  collections  include  corpora  where  correlations  are 
weakly  justified  (eg,  non-English  corpora)  or  not  justified  at  all  (eg,  expert  search).  Details 
of  these  corpora  and  runs  can  be  found  in  Appendix  A. 

4.3.3  Evaluation 

Given  a  set  of  retrievals,  potentially  from  a  combination  of  queries  and  systems,  we  mea¬ 
sure  the  correlation  of  the  rank  ordering  of  this  set  by  the  predictor  and  by  the  performance 
metric.  In  order  to  ensure  comparability  with  previous  results,  we  present  the  correlation 
between  the  predictor’s  ranking  and  ranking  based  on  average  precision  of  the  retrieval.  We 
present  results  for  Kendall’s  r,  Spearman’s  p,  and  Pearson’s  r.  Unless  explicitly  noted,  all 
correlations  are  significant  with  p  <  0.05. 

Predictors  can  sometimes  perform  better  when  linearly  combined  [Diaz  and  Jones,  2004; 
He  and  Ounis,  2004] .  Although  previous  work  has  presented  the  coefficient  of  determina¬ 
tion  ( R 2)  to  measure  the  quality  of  the  regression,  this  measure  cannot  be  reliably  used 
when  comparing  slight  improvements  from  combining  predictors.  Therefore,  we  adopt 
the  adjusted  coefficient  of  determination  which  penalizes  models  with  more  variables.  The 
adjusted  R 2  allows  us  to  evaluate  the  improvement  in  prediction  achieved  by  adding  a  pa¬ 
rameter  but  loses  the  statistical  interpretation  of  R2 .  We,  therefore,  will  use  the  correlation 
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coefficients  to  evaluate  the  magnitude  of  the  correlation  and  the  adjusted  R2  to  evaluate  the 
combination  of  variables. 

4.4  Results 

We  present  results  for  our  detailed  experiments  comparing  the  prediction  of  language 
model  scores  in  Table  4.1.  Although  the  Clarity  measure  is  theoretically  designed  for  lan¬ 
guage  model  scores,  it  consistently  underperforms  our  system-agnostic  predictor.  The  rank¬ 
ing  robustness  measure,  developed  by  Zhou  and  Croft  to  improve  Clarity’s  performance  on 
web  collections  (i.e.,  terabyte04,  terabyte05),  does  improve  the  r  correlation  from  0.139  to 
0.150  for  terabyte04  and  0.171  to  0.208  for  terabyte05.  However,  these  improvements  are 
slight  compared  to  the  performance  of  autocorrelation  on  these  collections.  Our  predic¬ 
tor  achieves  a  r  correlation  of  0.454  for  terabyte04  and  0.383  for  terabyte05.  Though  not 
always  the  strongest,  autocorrelation  achieves  correlations  competitive  with  baseline  predic¬ 
tors.  When  examining  the  performance  of  linear  combinations  of  predictors  (Table  4.2),  we 
note  that  in  every  case,  autocorrelation  factors  as  a  necessary  component  of  a  strong  pre¬ 
dictor.  We  also  note  that  the  adjusted  K2  for  individual  baselines  are  always  improved  by 
incorporating  autocorrelation. 

We  present  our  generalizability  results  in  Table  4.3.  For  every  collection  except  one,  we 
achieve  better  correlations  than  ranked-list  Clarity.  Surprisingly,  we  achieve  relatively  strong 
correlations  for  Spanish  and  Chinese  collections  despite  our  naive  processing.  We  do  not 
have  a  ranked-list  Clarity  correlation  for  ent05  because  we  did  not  have  a  clear  method  for 
building  a  query-independent  language  model  for  an  entity.  However,  our  autocorrelation 
measure  does  not  achieve  high  correlations  perhaps  because  relevance  for  entity  retrieval 
does  not  propagate  according  to  the  cooccurrence  links  we  use. 

As  noted  above,  the  poor  Clarity  performance  on  web  data  is  consistent  with  our  findings 
in  the  detailed  experiments.  Clarity  also  notably  underperforms  for  several  news  corpora 
(trec5,  tree  7,  and  robust04).  On  the  other  hand,  autocorrelation  seems  robust  to  the  changes 
between  different  corpora. 

4.5  Discussion 

We  present  the  results  presented  in  Section  4.4  to  provide  evidence  for  Hypothesis  4.1. 
Our  experiments  demonstrate  that  a  failure  to  respect  the  local  consistency  correlates  with 
poor  performance.  Why  might  systems  fail  to  score  topically-related  documents  consis¬ 
tently?  Query-based  information  retrieval  systems  often  score  documents  independently. 
All  of  the  retrieval  models  in  Chapter  2  score  documents  independently.  That  is,  the  score 
of  document  a  may  be  computed  by  examining  query  term  or  phrase  matches,  the  doc¬ 
ument  length,  and  perhaps  global  collection  statistics.  Once  computed,  though,  a  system 
rarely  compares  the  score  of  a  to  the  score  of  a  topically-related  document  b.  Our  results 
demonstrate  that,  when  absent,  attention  to  local  score  consistency  can  hurt  performance. 
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(a)  Kendall’s  t 


nV 

UKL 

P 

Im 

trec4 

0.353 

0.548 

0.513 

trec5 

0.311 

0.329 

0.357 

robust04 

0.418 

0.398 

0.373 

terabyte04 

0.139 

0.150 

0.454 

terabyte05 

0.171 

0.208 

0.383 

(b) 

Spearman’s  p 

nV 

UKL 

P 

Im 

trec4 

0.507 

0.738 

0.674 

trec5 

0.447 

0.475 

0.498 

robust04 

0.590 

0.567 

0.543 

terabyte04 

0.193 

0.221 

0.583 

terabyte05 

0.246 

0.307 

0.522 

(c 

:)  Pearson’s  r 

nV 
u KL 

P 

Im 

trec4 

0.430 

0.613 

0.645 

tree  5 

0.366 

0.454 

0.538 

robust04 

0.509 

0.554 

0.349 

terabyte04 

0.305 

0.341 

0.598 

terabyteOS 

0.206 

0.301 

0.539 

Table  4.1.  Comparison  of  autocorrelation  to  Robustness  and  Clarity  measures  for  language 
model  scores.  Evaluation  replicates  experiments  from  [Zhou  and  Croft,  2006] .  We  present 
correlations  between  the  classic  Clarity  measure  the  ranking  robustness  measure  (P), 

and  autocorrelation  (Jm)  each  with  mean  average  precision.  Measures  in  bold  represent  the 
strongest  correlation  for  that  test/ collection  pair. 


nV 
u KL 

P 

Im 

nV  p 
uKLir 

Im 

P,  hi 

P 

trec4 

0.168 

0.363 

0.422 

0.466 

0.420 

0.557 

0.553 

trec5 

0.116 

0.190 

0.236 

0.238 

0.244 

0.266 

0.269 

robust04 

0.256 

0.304 

0.278 

0.403 

0.373 

0.402 

0.442 

terabyte04 

0.059 

0.045 

0.292 

0.076 

0.293 

0.289 

0.284 

terabyte05 

0.022 

0.072 

0.193 

0.120 

0.225 

0.218 

0.257 

Table  4.2.  Combination  of  autocorrelation,  ranking  robustness,  and  Clarity  measures  for 
language  model  scores.  The  adjusted  coefficient  of  determination  is  presented  to  measure 
the  effectiveness  of  individual  predictors,  pairwise  combinations  of  predictors,  and  the  com¬ 
bination  of  all  predictors  (/?).  Measures  in  bold  represent  the  strongest  correlation  for  that 
test/ collection  pair. 
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Im 

trec3 

0.201 

0.461 

trec4 

0.252 

0.396 

trec5 

0.016 

0.277 

trec6 

0.230 

0.227 

tree  7 

0.083 

0.326 

trec8 

0.235 

0.396 

robust03 

0.302 

0.354 

robust04 

0.183 

0.308 

robust05 

0.224 

0.249 

terabyte04 

0.043 

0.245 

terabyte05 

0.068 

0.306 

trec4-spanish 

0.307 

0.388 

trec5-spanish 

0.220 

0.458 

trec5-chinese 

0.092 

0.199 

trec6-chinese 

0.144 

0.276 

ent05 

0.181 

Table  4.3.  Predicting  the  ranking  of  large  sets  of  retrievals  for  various  collections  and  re¬ 
trieval  systems.  Kendall’s  r  correlations  are  computed  between  the  predicted  ranking  and  a 
ranking  based  on  the  retrieval’s  average  precision.  Measures  in  bold  represent  the  strongest 
correlation  for  that  test/ collection  pair. 


45 


In  Equation  4.6,  we  presented  the  Moran  autocorrelation  measuring  the  local  consis¬ 
tency.  We  can  rewrite  this  equation  as  the  correlation  between  two  vectors, 


Im  — 


yTy 


|y||2||y||2 


(4.7) 


where  y  =  Wy.  This  implies  that  a  vector  of  scores,  y,  has  high  autocorrelation  if  it  is 
correlated  with  the  vector  y.  This  vector,  y,  can  be  interpreted  as  the  original  set  of  retrieval 
scores  “diffused”  over  the  adjacency  graph,  W.  From  another  perspective,  the  vector  y 
might  represent  a  high  quality  vector  of  scores  which  serves  as  a  surrogate  for  the  relevance 
vector,  r.  The  greater  the  correlation  with  this  high  quality  surrogate,  the  better  the  retrieval. 
If  we  treat  the  vector  y  as  a  high  quality  surrogate,  then  we  can  replace  it  with  a  set  of  scores 
which  we  know  to  be  very  good.  For  example,  the  combination  of  scores  from  multiple 
systems  often,  in  general,  results  in  very  good  retrieval  performance  [Montague  and  Aslam, 
2001],  We  can  treat  the  correlation  between  the  retrieval,  y,  and  the  combined  scores  as 
a  predictor  of  performance.  Assume  that  we  are  given  m  retrievals,  yt,  for  the  same  n 
documents.  We  will  represent  the  mean  of  these  vectors  as, 


i 

m 


12  y  i 


(4.8) 


We  use  the  mean  vector  as  an  approximation  to  relevance.  Because  yfl  represents  a  very  good 
retrieval,  we  hypothesize  that  a  strong  similarity  between  y/f  and  y  will  correlate  positively 
with  system  performance.  We  use  Pearson’s  product-moment  correlation  to  measure  the 
similarity  between  these  vectors, 


p(  y>  y„) 


yTyM 


lylbiiy^ib 


(4.9) 


Note  the  similarity  between  Equation  4.7  and  4.9.  A  form  of  this  type  of  precision  prediction 
was  proposed  by  Aslam  and  Pavlu  for  ranking  queries  according  to  difficulty  as  opposed  to 
retrieval  according  to  performance  [Aslam  and  Pavlu,  2007], 

The  retrievals  contained  in  the  TREC  data  consist  of  multiple  score  vectors  for  each 
query.  Therefore,  the  data  from  our  generalizability  experiments  allows  us  to  measure  the 
effect  of  replacing y  withy  .  We  present  the  results  in  Table  4.4.  In  almost  every  collection,  a 
retrieval’s  similarity  to  the  combined  scores,  y  ,  is  more  highly-correlated  with  performance. 

We  believe  that  autocorrelation  is,  like  multiple-retrieval  algorithms,  approximating  a 
good  ranking;  in  this  case  by  diffusing  scores.  However,  if  y  is  a  reasonable  surrogate,  then 
score  diffusion  tends  to,  in  general,  improve  performance.  Our  results  demonstrate  that  this 
approximation  is  not  as  powerful  as  information  from  multiple  retrievals.  Nevertheless,  in 
situations  where  this  extra  information  is  lacking,  perhaps  we  can  develop  techniques  to  use 
information  from  topically-related  documents  to  systematically  improve  retrieval  scores  in 
a  system-agnostic  manner. 


4.6  Summary 

In  this  chapter,  we  demonstrated  a  correlation  between  retrieval  performance  and  local 
score  consistency.  This  correlation  is  comparable  with  other  performance  predictors  in  the 
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trec3 

0.461 

0.439 

trec4 

0.396 

0.482 

trec5 

0.277 

0.459 

trec6 

0.227 

0.428 

tree  7 

0.326 

0.430 

trec8 

0.396 

0.508 

robust03 

0.354 

0.385 

robust04 

0.308 

0.384 

robust05 

0.249 

0.377 

terabyte04 

0.245 

0.420 

terabyte05 

0.306 

0.434 

trec4-spanish 

0.388 

0.398 

trec5-spanish 

0.458 

0.484 

trec5-chinese 

0.199 

0.379 

trec6-chinese 

0.276 

0.353 

ent05 

0.181 

0.305 

Table  4.4.  Using  a  higher  quality  surrogate.  We  compare  predictiveness  of  autocorrelation 
to  that  of  the  correlation  of  y  with  interpolated  scores  from  alternate  retrievals.  We  consider 
the  interpolation  vector  y/(  to  be  a  high  quality  surrogate  for  relevance. 
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literature.  Local  consistency  exhibits  this  correlation  across  a  diverse  set  of  retrieval  methods 
and  corpora.  We  believe  that  one  of  the  explanations  for  this  correlation  is  the  systemic 
absence  of  local  consistency  as  a  design  principle.  Based  on  an  informal  analysis,  we  also 
believe  that  our  predictor  suggests  a  possible  solution  this  lack  of  local  consistency.  In  the 
next  part  of  the  thesis,  we  will  develop  this  idea  into  a  robust  and  general  solution  to  local 
score  inconsistency. 
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PART  II 


REGULARIZATION  OF  RETRIEVAL  SCORES 


CHAPTER  5 


LOCAL  SCORE  REGULARIZATION 


There  is  a  correlation  between  local  score  consistency  and  retrieval  performance.  But  a 
correlation  alone  only  suggests  exploring  local  consistency  as  a  system  design  principle.  In 
this  chapter,  we  propose  the  following  causal  hypothesis, 

Hypothesis  5.1.  Given  a  set  of  retrieval  scores,  increasing  the  local  consistency  of  the  scores  improves 
retrieval  performance. 

We  will  test  this  hypothesis  by  defining  an  optimization  problem  whose  objective  function 
maximizes  local  consistency. 1 

We  treat  a  retrieval  as  a  mapping  or  function  from  documents  in  the  collection  to  a  real 
value.  For  example,  all  of  the  algorithms  from  Chapter  2  provide  a  score  for  each  document. 
Mathematically,  given  a  query,  q,  a  set  retrieval  model  provides  a  function,  fq  :  T>  — >  3?, 
from  documents  to  scores;  we  refer  to  fq  as  the  initial  score  function  for  a  particular  query.  The 
argument  of  this  function  is  the  retrieval  system’s  representation  of  a  document.  The  values 
of  the  function  induce  a  ranking.  Notice  that  we  index  functions  by  the  query.  We  do  this 
to  emphasize  the  fact  that,  in  information  retrieval,  the  score  function  over  all  documents 
will  be  different  for  each  query.  Although  we  drop  the  index  for  notational  convenience,  the 
reader  should  keep  in  mind  that  this  is  a  function  for  a  particular  query.  In  this  chapter,  we 
will  examine  the  behavior  of  score  functions  for  ranked  retrieval  models  with  respect  to  the 
geometry  of  the  underlying  domain,  T>. 

One  way  to  describe  a  function,  regardless  of  its  domain,  is  by  its  smoothness.  The  smooth¬ 
ness  of  a  function  might  be  measured,  for  example,  by  its  continuity,  as  in  Lipschitz  continu¬ 
ity.  In  many  situations,  we  prefer  functions  which  exhibit  higher  smoothness.  For  example, 
consider  the  one-dimensional  functions  in  Figure  5.1.  If  we  assume  that  local  consistency 
or  continuity  in  the  function  is  desirable,  then  the  function  depicted  in  the  Figure  5.1b  is 
preferable  because  it  is  smoother. 

If  only  presented  with  the  function  in  Figure  5. 1  a,  we  can  procedurally  modify  the  func¬ 
tion  to  better  satisfy  our  preference  for  smooth  functions.  The  result  may  be  the  function 
in  Figure  5.1b.  Post-processing  a  function  is  one  way  to  perform  regularization  [Chen  and 
Haykin,  2002] .  In  our  work,  we  regularize  initial  score  functions.  Because  our  analysis  and 
regularization  is  local  to  the  highest  scored  documents,  we  refer  to  this  process  as  local  score 
regularization. 

When  our  domain  was  the  real  line,  we  wanted  the  value  of  the  function  at  two  points, 
f{x i)  and  f(x2),  to  be  similar  if  the  distance  between  the  two  points,  |aq  —  a^l,  was  small.  In 

'Balinski  and  Danilowicz  recendy  proposed  a  similar  score-based  objective  [Balinski  and  Danilowicz, 
2005] .  Though  a  solution  is  presented,  we  are  not  aware  of  any  experimental  results. 
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(a)  Unregularized  (b)  Regularized 

Figure  5.1.  Functions  in  one  dimension.  Each  value  on  the  horizontal  axis  may,  for  ex¬ 
ample,  represent  a  one-dimensional  classification  code  such  as  a  linear  library  ordering  of 
books.  The  functions  in  these  figures  assign  a  value  to  each  point  on  the  real  line  and  may 
represent  relevance.  If  a  set  of  functions  are  intended  to  describe  the  same  phenomenon 
or  signal,  we  can  develop  criteria  for  preferring  one  function  over  another.  If  we  prefer 
smoother  function,  we  would  dismiss  the  function  in  a  in  favor  of  the  function  in  b.  The 
process  of  smoothing  the  function  in  a  into  the  function  in  b  is  a  type  of  regularization. 


information  retrieval,  our  domain  is  the  set  of  documents  and  we  want  the  value  of  the  func¬ 
tion  for  two  documents  to  be  similar  if  the  “distance  between  two  documents”  is  small.  We 
adopt  a  topic-based  distance  and  consider  two  documents  close  if  they  are  topically-related. 
We  will  refer  to  this  topical  relationship  as  topical  affinity.  Affinity  between  documents  can 
be  measured  using  techniques  from  Section  2.3.  We  would  like  two  documents  which  share 
the  same  topic  to  receive  similar  scores.  We  depict  this  graphically  in  Figure  5.2a  for  docu¬ 
ments  in  a  two-dimensional  embedding  space.  When  presented  with  a  query,  the  retrieval 
system  computes  scores  for  each  document  in  this  space  (Figure  5.2b);  this  is  our  initial  score 
function.  We  regularize  a  function  into  order  to  improve  the  consistency  of  scores  between 
neighboring  documents.  This  is  depicted  graphically  in  Figure  5.2c  where  the  value  of  the 
function  is  smoother  in  the  document  space.  Of  course,  realistic  collections  often  cannot  be 
visualized  like  this  two-dimensional  example.  Nevertheless,  the  fundamental  regularization 
process  remains  roughly  the  same. 

5.1  Problem  Statement 

We  now  formally  define  the  test  of  Hypothesis  5.1.  The  input  is  a  vector  of  document 
scores.  Although  the  system  usually  scores  all  n  documents  in  the  collection,  we  consider 
only  the  top  n  scores.  The  h  X  1  vector,  y,  represents  these  scores.  This  vector  may  be 
normalized  if  desired.  For  example,  we  normalize  this  vector  to  have  zero-mean  and  unit 
variance.  The  output  is  the  vector  of  regularized  scores  represented  by  the  n  X  1  vector  f. 
The  objective  of  the  regularization  process  is  to  improve  the  local  consistency  of  the  scores. 
If  the  ranking  induced  by  f  results  in  performance  superior  to  the  ranking  induced  by  y, 
then  we  claim  to  have  evidence  for  Hypothesis  5.1.  We  will  measure  performance  using 
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Figure  5.2.  Regularizing  retrieval  scores.  Documents  in  a  collection  can  often  be  em¬ 
bedded  in  a  vector  space  as  shown  in  a.  When  presented  with  a  query,  a  retrieval  system 
provides  scores  for  all  of  the  documents  in  the  collection  b.  Score  regularization  refers  to 
the  process  of  smoothing  out  the  retrieval  function  such  that  neighboring  documents  receive 
similar  scores  (c). 


mean  average  precision  which  provides  a  standard  and  stable  evaluation  metric  [Buckley 
and  Voorhees,  2000] . 

5.2  Local  Score  Regularization2 

Given  the  initial  scores  as  a  vector,  y,  we  would  like  to  compute  a  set  of  regularized 
scores,  f,  for  these  same  documents.  To  accomplish  this,  we  use  two  contending  objectives: 
score  consistency  between  related  documents  and  score  consistency  with  the  initial  retrieval. 
These  two  objectives  are  depicted  graphically  for  a  one-dimensional  function  in  Figure  5.3. 
Let  S  (f)  be  a  cost  function  associated  with  the  inter-document  consistency  of  the  scores,  f; 
if  related  documents  have  very  inconsistent  scores,  then  the  value  of  this  function  will  be 
high.  Let  £  (f,  y)  be  a  cost  function  measuring  the  consistency  with  the  original  scores;  if 
documents  have  scores  very  inconsistent  with  their  original  scores,  then  the  value  of  this 
function  will  be  high.  For  mathematical  simplicity,  we  use  a  linear  combination  of  these 
objectives  for  our  composite  objective  function, 

S(f,y)  =  S(f)  +  ^(f,y)  (5.1) 

where  //  is  a  parameter  allowing  us  to  control  how  much  weight  to  place  on  inter-document 
smoothing  versus  consistency  with  the  original  score. 5 

5.2.1  Measuring  Inter-Document  Consistency 

Inter-document  relatedness  is  represented  by  the  graph,  W,  defined  in  Section  2.3.3 
where  Wy  represents  the  affinity  between  documents  i  and  j.  We  define  our  graph  so  that 
there  are  no  self-loops  (Wu  =  0).  A  set  of  scores  is  considered  smooth  if  related  documents 

2We  present  a  regularization  method  which  applies  previous  results  from  machine  learning  [Zhou  et  al., 
2004] .  We  will  review  these  results  in  the  vocabulary  of  information  retrieval.  More  thorough  derivations  can 
be  found  in  cited  publications. 

’These  functions  operate  on  the  entire  vector  f  as  opposed  to  element-wise. 
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(a)  Smoothness  Constraint 


(b)  Error  Constraint 


Figure  5.3.  Smoothness  and  error  constraints  for  a  function  on  a  linear  graph.  In  Figure  a, 
the  smoothness  constraint  penalizes  functions  where  neighboring  nodes  in  f  receive  different 
values.  In  Figure  b,  the  error  constraint  penalizes  functions  where  nodes  in  f  receive  values 
different  from  the  corresponding  values  in  y. 


have  similar  scores.  In  order  to  quantify  smoothness,  we  define  the  cost  function,  5(f), 
which  penalizes  inconsistency  between  related  documents, 

n 

5(f)  =  Y,  (fi  ~  Sif  (5-2) 

i,j= 1 

We  measure  inconsistency  using  the  weighted  difference  between  scores  of  neighboring  doc¬ 
uments.4 

The  constraint  in  Equation  5.2  bears  a  close  relationship  to  the  Moran  autocorrelation 
in  Equation  4.5.  We  can  make  the  relationship  clear  by  rearranging  terms  in  Equation  5.2, 

n  n  h 

E  w‘i  Ui  -  /if  =  2  E  /M  - 2  E  »«/./,  (5-3) 

i,j= 1  i= 1  ij=l 

where  (1,  =  YdJj= i  VFp-  The  hrst  term  in  the  constraint  provides  a  weighted  L2  regularization 
of  the  solution  while  the  second  term  penalizes  solutions  with  low  autocorrelation. 

In  spectral  graph  theory,  Equation  5.2  is  known  as  the  Dirichlet  sum  [Chung,  1997], 
We  can  rewrite  the  Dirichlet  sum  in  matrix  notation, 

h 

E  I Vii  (/.  -  f,f  =  fT(D  -  W)f  (5.4) 

i,j=  1 

where  D  is  the  diagonal  matrix  dehned  as  Dv,  =  dt.  The  matrix  (D  —  W)  is  known  as  the 
combinatorial  Laplacian  which  we  represent  by  A c-  The  graph  Laplacian  can  be  viewed  as 
the  discrete  analog  of  the  Laplace-Beltrami  operator.  Because  the  Laplacian  can  be  used  to 
compute  the  smoothness  of  a  function,  we  may  abstract  Ac  and  replace  it  with  alternative 

4The  local,  discrete  Lipschitz  constant  for  a  document,  i,  can  be  thought  of  as  maxj  (To ||/t  —  fj\\). 
Although  similar,  the  local  Lipschitz  measure  is  much  less  forgiving  to  discontinuities  in  a  function.  Because 
our  retrieval  function  can  be  thought  of  as  a  very  peaked  or  spiky  function  due  to  the  paucity  of  relevant 
documents,  we  adopt  the  Laplacian-based  measure. 
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formulations  of  the  Laplacian  which  offer  alternative  measures  of  smoothness.  For  example, 
the  normalized  Laplacian  is  defined  as, 


an  =  D_1/2ACD_1/2 
=  I  -  d_1/2wd~1/2 


(5.5) 


measures  the  degree-normalized  smoothness  as, 


fTA^f 


h  W 

Eyv  i  j 

DD 

i,j= 1 


{. fi 


(5.6) 


The  approximate  Laplace-Beltrami  operator  is  a  variation  of  the  normalized  Laplacian  which 
uses  a  modified  affinity  matrix  [Lafon,  2004] .  The  approximate  Laplace-Beltrami  operator 
is  defined  as, 

Aa  =  I  -  6-1/2W6-1/2  (5.7) 

where  we  use  the  adjusted  affinity  matrix  W  =  D  1 WD  with  Dtt  =  YTj=\  W ij •  The 
approximate  Laplace-Beltrami  operator  theoretically  addresses  violations  of  the  uniform 
sampling  assumption.  Because  the  graph  W  will  be  built  from  a  biased  sample,  we  adopt 
the  approximate  Laplace-Beltrami  operator  (Equation  5.7)  in  our  work.  We  examine  the 
effect  of  this  choice  on  the  regularization  performance  in  Section  5.4.1. 

In  Section  2.3.3,  we  argued  that  visualization  based  on  the  eigenvectors  of  the  Laplacian 
was  not  suitable  for  visualization  because  it  ignored  subtle  aspects  of  the  graph.  The  use  of 
the  Laplacian  in  this  section,  therefore,  may  seem  ill-founded.  However,  we  point  out  that 
it  is  only  the  embedding  process — that  is,  taking  the  bottom  two  eigenvectors  or,  equiva¬ 
lently,  the  low  frequency  harmonics — which  makes  the  visualization  globally  biased.  In  this 
section,  we  do  not  perform  any  eigendecomposition  and  therefore  the  Laplacian  captures 
local  as  well  as  global  behavior. 

The  value  of  the  objective,  5(f)  is  small  for  smooth  functions  and  large  for  non-smooth 
function.  Unconstrained,  however,  the  function  minimizing  this  objective  is  the  constant 
function 


argminf5(f)  =  e 

In  the  next  section,  we  will  define  a  second  objective  which  penalizes  regularized  scores 
inordinately  inconsistent  with  the  initial  retrieval. 

5.2.2  Measuring  Consistency  with  Initial  Scores 

We  define  a  second  objective,  8  (f.  y),  which  penalizes  inconsistencies  between  the  initial 
retrieval  scores,  y,  and  the  regularized  scores,  f, 

n 

£(f,y)  =  B  f,-vi)2  (5.8) 

i= 1 

The  regularized  scores,  f,  minimizing  this  function  would  be  completely  consistent  with  the 
original  scores,  y;  that  is,  if  we  only  minimize  this  objective,  then  the  solution  is  f  =  y. 
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5.2.3  Minimizing  the  Objective  Function 

In  the  previous  two  sections,  we  defined  two  constraints,  S  (f)  and  £  (f.  y ) ,  which  can 
be  combined  as  a  single  objective,  f.  Formally,  we  would  like  to  find  the  optimal  set  of 
regularized  scores,  F,  such  that, 

F  =  argminfeSRfi  Q(  f,  y)  (5.9) 

In  this  section,  we  will  describe  two  solutions,  one  iterative  and  one  closed-form,  to  compute 
the  regularized  scores  F . 

Our  iterative  solution  to  this  optimization  interpolates  the  score  of  a  document  with  the 
scores  of  its  neighbors.  Metaphorically,  this  process,  at  each  iteration,  diffuses  scores  on  the 
document  graph.  This  is  accomplished  mathematically  by  defining  a  diffusion  operator,  S, 
for  each  Laplacian. 


S 

~Kf  W 

An  D'1/2WD  1/2 

Aa 

Given  this  operator,  the  score  diffusion  process  can  be  formulated  as, 

F+1  =  (1  -  a)y  +  aSF  (5.10) 

where  a  =  [Zhou  et  al.,  2004] .  We  can  initialize  the  regularized  scores  such  that  F  =  y. 
As  t  approaches  oo,  the  regularized  scores,  F,  converge  on  the  optimal  scores,  F.  Because  we 
build  our  graph  using  a  nearest-neighbor  technique,  this  solution  also  has  close  relationship 
to  nonparametric  regression  [Cover,  1968;  Devroye,  1978],  In  particular,  it  is  an  iterated 
nearest-neighbor  regression  in  the  ambient  space.  The  iterative  diffusion  in  Equation  5.10 
provides  an  intuition  for  the  solution  to  our  optimization. 

We  can  also  derive  a  closed  form  solution  to  Equation  5.9.  We  begin  by  taking  the 
derivative  of  Q(  f,  y)  with  respect  to  f, 

J^.Q(f,y)  =  Af+/r(f— y) 

Setting  this  equal  to  zero, 

AF  +  /i(F  -  y)  =  0 
aAF  +  (1  -  a)F  -  (1  -  a)y  =  0 

(aA  +  (l-a)I)F  =  (1  -  a)y 

F  =  (1  -  a)(aA  +  (1  -  a)I)-1y  (5.11) 

where  a  is  defined  above.  In  our  work,  we  will  be  using  this  closed  form  solution. 

Our  final  score  regularization  algorithm  is  presented  in  Figure  5.4.  Note  that  the  affinity 
matrix  computed  in  Step  1  is  used  for  adding  elements  to  W  in  Step  2  and  does  not  define  W 
itself  unless  k  =  n.  We  depict  a  graph  with  unregularized  and  regularized  scores  in  Figure 
5.5. 
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Local  Score  Regularization 


1 .  compute  n  X  h  affinity  matrix, A 

2.  add  the  k  nearest  neighbors  for  each  document  to  W 

3.  compute  Laplacian,  A 

4.  r  =  (1  -  a)  (aA  +  (1  -  a)!)”1  y 


n  number  of  document  scores  to  regularize 
y  top  n  initial  retrieval  scores 
k  number  of  neighbors  to  consider 
a  parameter  favoring  inter-document  consistency 
f*  regularized  scores 


Figure  5.4.  Local  Score  Regularization  Algorithm.  Inputs  are  n,  y,  k,  and  a.  The  output 
is  the  a  length  n  vector  of  regularized  scores,  f* . 


5.3  Experiments 

We  conducted  two  sets  of  experiments.  The  first  set  of  experiments  studies  the  behavior 
of  regularization  in  detail  for  four  retrieval  algorithms:  one  vector  space  model  algorithm 
(Okapi),  two  language  modeling  algorithms  (query  likelihood,  relevance  models),  and  one 
feature -based  algorithm  (Markov  random  field);  we  will  abbreviate  these  okapi,  QL,  RM, 
and  MRF.  We  present  detailed  results  demonstrating  improvements  and  parameter  stabil¬ 
ity.  We  will  refer  to  these  as  the  detailed  experiments.  The  second  set  of  experiments  applies 
regularization  to  automatic  runs  submitted  to  the  TREC  ad  hoc  retrieval  track.  These  ex¬ 
periments  demonstrate  the  generalizability  of  regularization.  A  detailed  description  of  these 
initial  retrievals  can  be  found  in  Appendix  A. 

The  regularization  parameters  consist  of  the  degree  of  regularization,  a,  and  the  num¬ 
ber  of  neighbors  used  for  defining  the  graph,  k.  In  the  detailed  experiments,  we  used  cosine 
similarity  for  the  okapi  baseline  and  the  diffusion  kernel  for  the  QL,  RM,  and  MRF  base¬ 
lines.  When  our  similarity  measure  is  the  diffusion  kernel,  we  also  train  the  bandwidth 
parameter,  t.  Parameter  values  considered  are, 

parameter  range 

a  [0. 1-0.9;  0.1] 

k  {5,10,25} 

t  {0.1,0.25,0.50,0.75,0.90} 

We  describe  the  data  for  our  generalizability  experiments  in  Appendix  A.  Due  to  the 
large  number  of  runs,  we  fix  k  =  25  and  sweep  a  between  0.05  and  0.95  with  a  step  size 
of  0.05.  The  optimal  a  is  selected  using  10-fold  cross  validation  optimizing  mean  average 
precision. 
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(a)  unregularized  scores 


(b)  regularized  scores 


Figure  5.5.  Unregularized  and  regularized  scores  for  the  query  “U.  S.  Restaurants  in  For¬ 
eign  Lands”. 
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We  normalized  all  scores  to  zero  mean  and  unit  variance  for  empirical  and  theoretical 
reasons  [Belkin  et  al.,  2004;  Montague  and  Aslam,  2001], 

5.4  Results 

5.4.1  Detailed  Experiments 

Our  first  set  of  experiments  explored  the  impact  of  score  regularization  on  four  state-of- 
the-art  baselines.  We  present  precision-recall  curves  for  regularizing  these  scores  in  Figures 
5.6  (tree  12)  and  5.7  (robust).  The  detailed  tables  of  these  results  showing  regularization  for 
different  values  of  n  can  be  found  in  Appendix  D. 

We  notice  that  mean  average  precision  improves  for  all  baseline  algorithms.  These  im¬ 
provements  are  all  significant  with  p  <  0.05  using  the  Wilcoxon  test.  The  gains  for  query 
likelihood  and  the  Markov  random  held  model  are  all  approximately  10%  relative  to  the 
baseline.  Regularizing  okapi  scores  results  in  a  smaller  relative  gain  (6-9%).  We  find  weak¬ 
est  improvements  (3%)  with  our  strongest  baseline,  relevance  models.  Nevertheless,  even 
this  run  sees  significant  gains  in  mean  average  precision. 

The  precision-recall  graphs  in  Figures  5.6  and  5.7  can  be  used  to  detect  the  location  of 
improvements  with  respect  to  the  ranked  list.  All  of  the  improvements  from  regularization 
affect  the  middle-recall  parts  of  ranked  lists,  resulting  in  a  ballooning  out  of  the  precision- 
recall  curve  between  recall  points  of  0.20  and  0.60.  At  low  recall  points,  we  only  see  slight 
degradations  in  performance  if  any.  The  improvements  resulting  from  regularization,  there¬ 
fore,  do  not  indicate  that  we  are  trading  high  precision  for  improvements  in  mean  average 
precision;  the  gains  are  consistent  across  all  recall  points. 

In  order  to  examine  the  performance  changes  contributing  to  the  mean  changes,  we  plot 
the  distribution  of  relative  changes  in  Figure  5.8.  The  red  bars  represent  improvements; 
blue  bars  represent  degradations.  In  all  cases,  we  expect  there  to  be  more  improvements 
than  degradations.  We  are  interested  in  measuring  the  robustness  of  a  net  improvement 
by  inspecting  the  distribution  of  per-query  improvements.  A  net  improvement  is  unstable 
if  we  see  many  queries  substantially  improved  and  many  queries  substantially  degraded.  A 
net  improvement  is  stable,  if  we  see  improvements  in  general  and  only  slight  degradations. 
Across  all  systems  and  collections,  the  improvements  are  dominated  by  slight  improvements 
between  0  and  25%.  On  the  other  hand,  the  majority  of  degradations  are  also  small.  With 
our  strongest  baseline,  these  slight  changes  in  performance  account  for  the  vast  majority  of 
changes  in  performance,  implying  that  regularization  can  be  applied  without  fear  of  signif¬ 
icantly  hurting  some  queries.  Our  other  baselines  demonstrate  many  improvements  above 
25%  while  avoiding  a  large  number  of  substantial  degradations. 

Next,  we  examine  the  impact  of  our  choice  of  Laplacian.  In  Section  5.2.1,  we  described 
three  alternative  definitions  of  the  graph  Laplacian.  Because  our  top  n  documents  were 
likely  to  be  a  non-uniform  sample  across  topics,  we  adopted  the  approximate  Laplace- 
Beltrami  operator  which  addresses  sampling  violations.  In  order  to  evaluate  this  choice 
of  Laplacian,  we  compared  the  absolute  improvements  in  performance  for  all  three  Lapla- 
cians.  Our  hypothesis  was  that  the  approximate  Laplace-Beltrami  operator,  because  it  is 
designed  to  be  robust  to  sampling  violations,  would  result  in  strong  improvements  in  perfor¬ 
mance.  The  results  of  this  comparison  are  presented  in  Figure  5.9.  In  all  cases  the  simple 
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Figure  5.6.  Precision-recall  curves  for  regularized  tree  1 2  scores.  Mean  average  precision 
shown  in  parentheses. 
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Figure  5.7.  Precision-recall  curves  for  regularized  robust  scores.  Mean  average  precision 
shown  in  parentheses. 
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Figure  5.8.  Distribution  of  relative  improvements  and  degradations  in  performance  for 
detailed  experiments. 
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combinatorial  Laplacian  clearly  underperforms  other  Laplacians.  Recall  from  Equation 

5.2  that,  although  it  weighs  the  comparisons  in  scores  between  documents  using  W,j ,  the 
combinatorial  Laplacian  does  not  normalize  this  weight  by  the  node  degrees  (ie,  Du).  Both 
the  normalized  Laplacian  (Equation  5.6)  and  the  approximate  Laplace-Beltrami  operator 
(Equation  5.7)  normalize  this  weight.  However,  there  do  not  appear  to  be  significant,  con¬ 
sistent  advantages  to  using  the  approximate  Laplace-Beltrami  operator  over  the  normalized 
Laplacian.  This  result  suggests  that,  while  degree  normalize  is  important,  our  data  may 
not  exhibit  the  appropriate  characteristics  to  notice  any  benefit  to  using  the  approximate 
Laplace-Beltrami  operator. 

Our  first  set  of  experiments,  described  in  Figures  5.6  and  5.7,  demonstrated  improve¬ 
ments  across  all  four  baseline  algorithms.  The  a  parameter  controls  the  degree  of  regular¬ 
ization  and  therefore  the  amount  of  local  consistency  introduced.  In  Ligure  5.10,  we  plot 
the  effect  of  regularization  as  a  function  of  this  parameter.  We  see  that,  in  some  cases,  reg¬ 
ularization  actually  hurts  performance  at  high  values  of  a.  This  results  from  the  fact  that, 
as  a  approaches  1 ,  the  document  scores  grow  increasingly  similar,  at  some  point  becoming 
constant.  This  effect  is  more  noticeable  with  our  pseudo-relevance  baseline,  indicating  that 
perhaps  the  scores  may  already  locally-consistent.  We  will  return  to  this  in  observation  in 
the  next  chapter.  Other  baselines,  however,  see  improvements  from  higher  ranges  of  a.  Ap¬ 
pealing  to  the  diffusion  metaphor,  a  higher  a  that  documents  are  gaining  more  information 
from  their  neighbors  than  from  their  original  scores. 

One  of  the  core  assumptions  behind  our  technique  is  the  presence  of  a  lower-dimensional 
structure  recovered  by  the  graph.  The  number  of  neighbors,  k,  represents  how  much  we 
trust  the  ambient  affinity  measure  for  this  set  of  documents.  If  performance  improves  as 
we  consider  more  neighbors,  graph-based  methods  are  less-justified.  In  Ligure  5.11,  we 
evaluate  performance  as  a  function  of  the  number  of  neighbors.  Across  all  algorithms  and 
all  distance  measures,  we  notice  a  degradation  in  performance  as  more  neighbors  are  con¬ 
sidered.  This  occurs  even  in  the  presence  of  a  soft  nearest  neighbor  measure  such  as  the 
diffusion  kernel.  This  behavior  might  result  from  several  causes.  Lor  example,  our  similar¬ 
ity  measure  may  be  inaccurate  for  larger  dissimilarities,  resulting  in  an  ability  to  accurately 
order  several  dissimilar  documents.  Alternatively,  we  may  be  experiencing  a  non-uniform 
distribution  of  documents  in  the  ambient  space  (recall  our  discussion  of  this  in  Section  3.3). 

5.4.2  Generalizability  Experiments 

Our  detailed  experiments  demonstrated  the  improvement  of  performance  achieved  by 
regularizing  three  four  baselines.  We  were  also  interested  in  the  performance  over  a  wide 
variety  of  initial  retrieval  algorithms.  We  present  results  for  regularizing  the  TREC  submis¬ 
sions  in  Ligures  5.12  and  5.13.  Although  regularization  on  average  produces  improvements, 
there  are  a  handful  of  runs  for  which  performance  is  degraded.  Inspecting  these  baselines 
of  the  more  dramatic  degradations  (trecB),  we  noticed  that  the  scores  for  these  runs  had 
odd  distributions,  greatly  affecting  our  normalization  procedures.  Other  reductions  in  per¬ 
formance  may  be  the  result  of  an  unoptimized  k  parameter.  Improvements  are  consistent 
across  collections  and  languages. 
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Figure  5.9.  Performance  improvement  as  a  function  of  Laplacian  type.  For  each  Lapla- 
cian  described  in  Section  5.2.1,  we  maximized  mean  average  precision  using  10-fold  cross- 
validation  (left:  combinatorial  Laplacian,  center:  normalized  Laplacian,  right:  approximate 
Laplace-Beltrami).  The  different  Laplacians  represent  different  degree  normalization  tech¬ 
niques. 
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Figure  5.10.  Performance  as  a  function  of  amount  of  regularization.  For  each  value  of  a, 
we  selected  the  values  for  k  and  t  maximizing  mean  average  precision.  A  higher  value  for 
a  results  in  more  aggressive  regularization.  A  low  value  of  a  recovers  the  original  scores. 
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Figure  5.11.  Performance  as  a  function  of  number  of  neighbors.  For  each  value  of  k,  we 
selected  the  value  for  a  and  t  maximizing  mean  average  precision.  If  we  trust  the  distance 
metric,  we  would  expect  the  performance  to  increase  as  we  increase  the  number  of  neigh¬ 
bors. 
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Figure  5.12.  Improvement  in  mean  average  precision  for  TREC  query-based  retrieval 
tracks.  Each  point  represents  a  competing  run.  The  horizontal  axis  indicates  the  original 
mean  average  precision  for  this  run.  The  vertical  axis  indicates  the  mean  average  preci¬ 
sion  of  the  regularization  run.  Red  points  indicate  an  improvement;  blue  points  indicate 
degradations. 
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Figure  5.13.  Improvement  in  mean  average  precision  for  TREC  query-based  retrieval 
tracks.  Each  point  represents  a  competing  run.  The  horizontal  axis  indicates  the  original 
mean  average  precision  for  this  run.  The  vertical  axis  indicates  the  mean  average  preci¬ 
sion  of  the  regularization  run.  Red  points  indicate  an  improvement;  blue  points  indicate 
degradations. 
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5.5  Discussion 


We  introduced  score  regularization  in  order  to  test  Hypothesis  5.1.  The  results  from 
our  experiments  indicate  that  increasing  the  local  consistency  of  scores  does  improve  per¬ 
formance. 

We  have  also  developed  a  generic  post-processing  procedure  for  improving  the  perfor¬ 
mance  of  arbitrary  score  functions.  The  results  in  Figures  5.12  and  5.13  provide  evidence 
that  existing  retrieval  algorithms  can  benefit  from  regularization. 

The  results  in  Figures  5.9  and  5. 1 1  suggest  that  the  construction  of  the  diffusion  operator 
is  sometimes  important  for  regularization  efficacy.  Since  there  are  a  variety  of  methods  for 
constructing  affinity  and  diffusion  geometries,  we  believe  that  this  should  inspire  a  formal 
investigation  and  comparison  of  various  approaches. 

The  results  in  Figure  5.11  also  allow  us  to  test  the  manifold  properties  of  the  initial 
retrieval.  The  relatively  small  range  of  change  of  the  curves  for  the  relevance  model  run 
implies  that  the  ambient  measure  behaves  well  for  the  documents  in  this  retrieval.  Poorer- 
performing  algorithms,  by  definition,  have  a  mix  of  relevant  and  non-relevant  documents. 
Including  more  edges  in  the  graph  by  increasing  the  value  of  k  will  be  more  likely  to  re¬ 
late  relevant  and  non-relevant  documents.  From  the  perspective  of  graph-based  methods, 
the  initial  retrieval  for  poorer-performing  algorithms  should  be  aggressively  sparsified  with 
low  values  for  k.  On  the  other  hand,  better  performing  algorithms  may  benefit  less  from  a 
graph-based  representation  allowing  us  to  let  k  grow.  From  a  geometric  perspective,  doc¬ 
uments  from  poorer-performing  algorithms  are  retrieved  from  regions  of  the  embedding 
space  so  disparate  that  topical  relationships  are  poorly-approximated  by  the  ambient  affin¬ 
ity.  Documents  from  better  performing  queries  all  exist  in  a  region  of  the  embedding  space 
where  affinity  is  well-approximated  by  the  ambient  affinity. 

We  have  noted  that  the  aggressiveness  of  regularization  (a)  is  related  to  the  performance 
of  the  initial  retrieval.  Figure  5.10  demonstrates  that  smaller  values  for  a  are  more  suitable 
for  better-performing  algorithms.  This  indicates  that  the  use  of  techniques  from  precision 
prediction  may  help  to  automatically  adjust  the  a  parameter  [Carmel  et  al.,  2006;  Cronen- 
Townsend  et  al.,  2002;  Yom-Tov  et  al.,  2005], 

We  mentioned  in  Chapter  3  that  the  cluster  hypothesis  motivates  retrieval  methods  such 
as  cluster-based  retrieval.  It  is  worth  comparing  the  effectiveness  improvements  resulting 
from  regularization  to  those  resulting  from  other  clustering  methods.  Therefore,  we  regu¬ 
larized  the  scores  of  a  strong  baseline  for  a  collection  used  in  previous  cluster-based  retrieval 
work  [Liu  and  Croft,  2004,  2006],  Results  are  presented  in  Table  5.1.  The  small  set  of 
results  in  this  table  indicate  that  the  improvements  achieved  by  regularization  are  as  strong 
as  those  achieved  by  the  various  cluster-based  retrieval  methods.  We  speculate  that  this  is 
because  the  clustering  done  by  these  previous  methods  considers  larger  scale  than  the  local 
methods  implicit  in  regularization.  Clustering  into  large  clusters  results  in  introducing  rela¬ 
tionships  between  otherwise  dissimilar  documents  and  smoothing  into  more  general  topics 
than  the  query  requires. 

Finally,  we  should  address  the  question  of  efficiency.  There  are  two  points  of  compu¬ 
tational  overhead  in  our  algorithm.  First,  the  construction  of  the  n  X  n  affinity  matrix 
requires  0(n2)  comparisons.  For  n  =  1000,  this  took  approximately  8  seconds.  Although 
most  of  our  experiments  use  h  =  1000,  we  can  inspect  the  improvement  in  performance 
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QL  single  link  average  link  Ward  CBDM  LSR 
"AP  0.2179  0.2153  0.2161  0.2160  0.2326  0.2562 

WSJ  0.2958  0.2911  0.2902  0.2963  0.3006  0.3141 


Table  5.1.  Comparison  of  cluster-based  retrieval  methods  and  regularization.  Mean  aver¬ 
age  precision  is  presented  for  the  Associated  Press  and  Wall  Street  Journal  collections  [Liu 
and  Croft,  2004],  The  columns  labeled  “single  link”,  “average  link”,  and  “Ward”  refer 
to  agglomerative  clustering  methods.  The  column  labeled  “CBDM”  refers  to  a  cluster- 
based  document  expansion  method.  All  non-regularization  performance  values  are  copied 
directly  from  previous  publications.  Regularization  used  a  query  likelihood  baseline  which 
was  comparable  in  terms  of  performance  to  the  baseline  used  in  the  referenced  publications. 


as  a  function  of  the  number  of  documents  being  regularized,  n.  In  Figure  5.14,  we  notice 
that  performance  improves  and  then  plateaus.  These  results  show  that  n  need  not  be  as 
large  as  this  to  achieve  improvements.  For  example,  for  n  =  100,  this  computation  takes 
less  than  0.5  seconds.  We  should  also  point  out  that  we  can  compute  the  entire  collection 
affinity  matrix  and  store  it  prior  to  any  retrieval.  In  Figure  5. 1 1,  we  showed  that  only  very 
few  neighbors  were  required  to  perform  well,  implying  that  the  storage  cost  can  be  0{nk). 

The  second  point  of  computational  overhead  is  in  the  inversion  of  the  matrix  in  Equation 
5. 1 1 .  We  show  running  time  as  a  function  of  n  in  Figure  5.15.  Note  that  our  experiments, 
although  very  expensive  when  n  =  1000,  can  be  computationally  improved  significantly  by 
reducing  n  to  500  which,  according  to  Figure  5.14,  would  still  boost  baseline  performance. 
We  could  also  address  the  inversion  by  using  the  iterative  solution.  In  related  work,  using  a 
pre-computed  similarity  matrix  and  an  iterative  solution  allowed  real-time  pseudo-relevance 
feedback  [Lavrenko  and  Allan,  2006] . 

5.6  Conclusions 

We  have  provided  substantial  evidence  that  the  introduction  of  local  consistency  into  a 
set  of  retrieval  scores  improves  performance.  Our  results  do  not  suggest  a  monotonic  rela¬ 
tionship  since  the  benefits  fell  after  the  some  amount  of  regularization.  Although  we  began 
this  chapter  intending  to  test  a  hypothesis  about  the  relationship  between  local  consistency 
and  performance,  as  a  byproduct,  we  have  developed  a  black  box  method  for  improving  the 
performance  arbitrary  retrieval  algorithms.  Having  demonstrated  demonstrated  the  effec¬ 
tiveness  of  regularization,  in  the  subsequent  chapters,  we  will  study  regularization  in  more 
technical  detail. 
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Figure  5.14.  Performance  as  a  function  of  number  of  documents  used  for  regularization. 
For  each  value  of  n,  we  selected  the  values  for  a,  k  and  t  maximizing  mean  average  precision. 
A  higher  value  for  n  considers  more  documents  in  the  regularization. 
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Figure  5.15.  Running  time  as  a  function  of  number  of  documents  used  for  regularization. 
For  each  value  of  n,  we  regularized  the  scores  given  a  pre-computed  affinity  matrix. 
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CHAPTER  6 


CONNECTIONS  BETWEEN  REGULARIZATION  AND  OTHER 

RETRIEVAL  METHODS 


Several  classic  retrieval  methods  can  be  posed  as  instances  of  score  regularization.  We 
will  be  focusing  on  the  relationship  between  these  methods  and  a  single  iteration  of  score 
regularization  (Equation  5.10).  In  previous  chapters,  we  considered  only  the  top  n  <C  n 
documents  from  some  initial  retrieval.  In  this  section,  we  may  at  times  consider  every  doc¬ 
ument  in  the  collection  (ie,  n  =  n). 

In  this  chapter,  we  will  demonstrate  that  regularization  is  a  common  criterion  for  many 
successful  retrieval  methods.  This  approach  is  similar  to  Fang  et  a/’s  study  of  heuristics  shared 
across  formal  models  of  information  retrieval  [Fang  et  al.,  2004,  p.  49;  emphasis  added]. 

Despite  the  progress  in  the  development  of  formal  retrieval  models,  good  em¬ 
pirical  performance  rarely  comes  directly  from  a  theoretically  well-motivated 
model;  rather,  heuristic  modification  of  a  model  is  often  necessary  in  order  to 
achieve  optimal  retrieval  performance.  Indeed,  many  empirical  studies  show  that 
good  retrieval  performance  is  closely  related  to  the  use  of  various  retrieval  heuristics,  espe¬ 
cially  TF-IDF  weighting  and  document  length  normalization. 

We  will  show  that  regularization,  like  tf.idf  and  document  length  normalization,  is  a  property 
found  in  many  successful  retrieval  methods. 

For  each  of  the  methods  in  this  section,  we  will  be  asking  ourselves  the  following  question: 
can  the  final  retrieval  scores  be  computed  as  a  function  of  the  initial  retrieval  scores  and  a 
similarity-based  adjacency  matrix?  If  the  answer  to  this  question  is  “yes”,  then  we  can  state 
that  this  method  is  an  instance  of  score  regularization. 

6.1  Vector  Space  Model  Retrieval 

In  Section  2.3.1,  we  represented  each  document  as  an  L2  normalized,  length-m  vector, 
d.  A  query  can  also  be  represented  by  a  normalized,  length-m  vector,  q.  A  document’s 
score  is  the  inner  product  between  its  vector  and  the  query  vector  (ie,  r/i  =  (d,,  q)). 

Claim  6.1.  Pseudo-relevance  feedback  in  the  vector  space  model  using  Rocchio  is  a  form  of  regularization. 
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Figure  6.1.  Hard  weighting  function  for  pseudo-relevance  feedback.  The  horizontal  axis 
represents  the  documents  in  decreasing  order  of  y.  The  function  <r(y)  acts  as  a  filter  for 
pseudo-relevant  documents.  It  sets  the  score  of  each  of  the  top  r  documents  to  1 . 


Proof.  First,  we  note  that  the  similarity  between  a  document  and  the  new  query  can  be 
written  as  the  combination  of  the  original  document  score  and  the  sum  of  similarities  to  the 
pseudo-relevant  set, 


=  (dt,q)  +  "  (dj:,^dA 
\  j£R  / 

Qj 

=  (di,q>  +  -  Y,  (di,dj)  (6.1) 

r  j£R 


Notice  here  that  the  first  factor  in  the  sum  is  t/j  and  the  second  factor  in  the  sum  represents 
the  similarity  to  the  pseudo-relevant  documents,  J2jeR^-ij-  We  can  rewrite  Equation  6.1 
in  terms  of  matrix  operators  to  compute  the  new  scores  for  all  documents  in  the  collection. 
This  computation  is  a  function  of  the  initial  scores  and  the  inner  product  affinity  matrix, 

i y 

f  =  y  +  Atry  (6.2) 

My)lli 

where  <r(y)  :  IJT1  — ■>  5in  is  defined  as, 

(T(y)i  =  l  1  lf  j  G  R  (6.3) 

^  0  otherwise  v  y 

We  compare  cr(y)  to  y  in  Figure  6.1.  The  cr  function  maps  high-ranked  documents  to 
pseudo-scores  of  1.  This  behavior  replicates  the  judgment  of  documents  as  relevant.  From 
our  perspective  of  score  functions,  we  see  that  cr  acts  as  a  hard  filter  on  the  signal  y.  This 
demonstrates  that  Rocchio  is  an  instance  of  score  regularization. 

□ 
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Whereas  pseudo-relevance  feedback  incorporates  into  a  query  terms  from  r  pseudo¬ 
relevant  documents,  document  expansion  incorporates  into  a  document  the  terms  from  its  k 
most  similar  neighbors  [Singhal  and  Pereira,  1 999] .  The  modified  document,  d,  is  defined 
as, 

1  X  - 

d  =  aDd i  +  -  di  (6.4) 

K  jeN(i) 

where  (Xd  is  the  weight  placed  on  the  original  document  vector.  N{%)  is  the  set  of  k  docu¬ 
ments  most  similar  to  document  i. 

Claim  6.2.  Document  expansion  in  the  vector  space  model  is  a  form  of  regularization. 

Proof  Define  the  binary  matrix  W  so  that  each  row  i  contains  k  non-zero  entries  for  each 
of  the  indices  in  N(i).  We  can  expand  all  documents  in  the  collection, 

C  =  aDC  +  7WC  (6.5) 

k 

Given  a  query  vector,  we  can  score  the  entire  collection, 

f  =  Cq 

=  (aD  C  +  — WC)q 
K 

1 

=  «£)Cq+-WCq 
k 

1 

=  ct£>y  +  -Wy  (6.6) 

k 

The  point  here  is  that  the  score  of  an  expanded  document  (/))  is  the  linear  combination  of 
the  original  score  (t/;)  and  the  scores  of  its  k  neighbors  (|  Vi)-  This  demonstrates  that 

document  expansion  is  a  form  of  regularization.  □ 

We  now  turn  to  the  dimensionality  reduction  school  of  cluster-based  retrieval  algorithms. 
In  the  previous  proof,  we  expanded  the  entire  collection  using  the  matrix  W.  Clustering 
techniques  such  as  Latent  Semantic  Indexing  (LSI)  can  also  be  used  to  expand  documents 
[Deerwester  et  al.,  1990],  LSI-style  techniques  use  two  auxiliary  matrices:  V  is  the  k  x  n 
matrix  embedding  documents  in  the  A'-dimcnsional  space  and  U  is  m  X  k  representations 
of  the  dimensions  in  the  ambient  space.  Oftentimes,  queries  are  processed  by  projecting 
them  into  the  /e-dimcnsional  space  (ie,  q  =  UTq).  We  use  an  equivalent  formula  where  we 
expand  documents  by  their  LSI-based  dimensions, 

C  =  AC  +  (1-A)VtUt 

We  then  score  a  document  by  its  cluster-expanded  representation.1 

Claim  6.3.  Cluster-based  retrieval  in  the  vector  space  model  is  a  form  of  regularization. 

'in  practice,  the  document  representations  are  only  based  on  the  cluster  information  (ie,  A  =  0).  Our 
ranking  function  generalizes  classic  cluster-based  retrieval  functions. 
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Proof.  Our  proof  is  similar  to  the  proof  for  document  expansion. 

f  =  Cq 

=  (AC  +  (1  -  A)VTUT)q 
=  Ay  +  (1  —  A)VT[UTq] 

=  Ay  +  (1  —  A)VTyc  (6.7) 

Because  the  dimensions  (clusters)  are  representable  in  the  ambient  space,  we  can  score  them 
as  we  do  documents;  here,  we  use  the  k  x  1  vector,  yc  to  represent  these  scores.  Essentially, 
the  document  scores  are  interpolated  with  the  scores  of  the  clusters.  □ 

6.2  Language  Modeling  Retrieval 

In  the  previous  section,  we  demonstrated  the  equivalence  between  several  vector  space 
model  techniques  and  regularization.  In  the  context  of  retrieval  using  language  models,  we 
can  only  show  a  reduction  for  pseudo-relevance  feedback.  This  result,  though,  is  significant 
since  it  provides  an  alternative  explanation  for  the  success  of  this  method. 

Claim  6.4.  Relevance  models  are  a  form  of  regularization. 

Proof.  Our  proof  is  based  on  a  similar  derivation  used  in  the  context  of  efficient  pseudo¬ 
relevance  feedback  [Lavrenko  and  Allan,  2006],  Recall  that  we  use  (logC)q  to  rank  the 
collection.  By  rearranging  some  terms,  we  can  view  relevance  models  from  a  different  per¬ 
spective, 


f 


(logC)q 

(logC)  ^Aq  +  (j|y||^CV) 

VlogC)q+EqAl08'C)CTy 


lyiii 


Ay  +  — 7T— ^-Ay 

llylli 


(6.8) 


where  A  is  an  n  X  n  affinity  matrix  based  on  inter-document  cross-entropy.  Since  the  rele¬ 
vance  model  scores  can  be  computed  as  a  function  of  inter-document  affinity  and  the  initial 
scores,  this  is  an  instance  of  score  regularization.  In  fact,  iterating  the  process  in  Equation 
2.13  has  been  shown  to  improve  performance  of  relevance  models  and  provides  an  argu¬ 
ment  for  considering  the  closed  form  solution  in  Equation  5. 1 1  [Kurland  et  al.,  2005]  ,2  □ 

Unfortunately,  we  cannot  reduce  document  expansion  in  the  language  modeling  frame¬ 
work  to  regularization.  Document  expansion  in  language  modeling  refers  to  adjusting  the 

2In  Section  2.3.2,  we  adopted  the  symmetric  diffusion  kernel  to  compare  distributions.  The  cross-entropy 
measure  here  is  asymmetric  and  therefore  cannot  be  used  in  our  closed  form  solution.  Nevertheless,  our  itera¬ 
tive  solution  is  not  constrained  by  the  symmetry  requirement.  Furthermore,  theoretical  results  for  Laplacians 
of  directed  graphs  exist  and  can  be  applied  in  our  framework  [Chung,  2004;  Zhou  et  al.,  2005], 
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document  language  models  P{w\0d)  given  information  about  neighboring  documents  [Tao 
et  al.,  2006],  In  this  situation,  the  score  function  can  be  written  as, 


f  =  log  (AG  +  (1  —  A)AC)  q  (6.9) 

Because  the  logarithm  effectively  decouples  the  document  expansion  from  the  document 
scoring,  the  approach  used  in  the  vector  space  model  proof  cannot  be  used  here. 

The  language  modeling  approach  to  cluster-based  retrieval  is  conceptually  very  simi¬ 
lar  to  document  expansion  [Liu  and  Croft,  2004;  Wei  and  Croft,  2006],  The  distribution 
P(z\D)  represents  the  distribution  of  subtopics  or  aspects  in  a  document;  we  also  have 
P(w\z)  representing  language  models  for  each  of  our  subtopics.  When  we  interpolate  these 
models  with  the  maximum  likelihood  document  models,  we  get  a  score  function  similar  to 
Equation  6.7, 


f  =  log  (AC  +  (1  -  A)VTUT)  q  (6.10) 

where  V  is  the  k  X  n  distribution  P(z\D)  and  U  is  the  m  X  k  distribution  P(w\z).  Like 
document  expansion  scores,  the  logarithm  prevents  converting  cluster-based  expansion  into 
a  regularization  form. 

It  is  worth  devoting  some  time  to  Kurland  and  Lee’s  cluster-based  retrieval  model  [Kur¬ 
land  and  Lee,  2004],  The  model  is  used  to  perform  retrieval  in  three  steps.  Lirst,  each 
document  is  scored  according  to  an  expanded  document  model.  Second,  an  n  x  n  ma¬ 
trix  comparing  unexpanded  and  expanded  models  is  constructed.  Linally,  each  document 
is  scored  by  the  linear  interpolation  of  its  original  (unexpanded)  score  and  the  scores  of 
the  nearest  expanded  documents.  To  this  extent,  the  model  combines  regularization  and 
document-expansion  retrieval  in  a  language  modeling  framework.  Unfortunately,  there  do 
not  appear  to  be  experiments  demonstrating  the  effectiveness  of  each  of  these  steps.  Is  this 
model  an  instance  of  score  regularization?  Yes  and  no.  The  second  interpolation  process 
clearly  is  an  iteration  of  score  regularization.  The  hrst  score  is  language  model  document 
expansion  and  therefore  not  regularization. 

Recall  that  the  vector  space  model  allowed  fluid  mathematical  movement  from  query 
expansion  to  regularization  to  document  expansion  and  finally  to  cluster-based  retrieval. 
This  is  not  the  case  for  language  modeling.  Language  models  have  a  set  of  rank-equivalent 
score  functions;  we  adopt  cross  entropy  in  our  work.  The  problem,  however,  is  that  measures 
such  as  the  Kullback-Leibler  divergence,  cross  entropy,  and  query  likelihood  all  are  non- 
symmetric  and  therefore  not  valid  inner  products.  This  disrupts  the  comparison  to  the 
vector  space  model  derivations  because  a  smooth  transition  from  regularization  (Equation 
6.8)  to  document  expansion  is  impossible. 

6.3  Feature-Based  Retrieval 

Pseudo-relevance  feedback  in  the  context  of  feature-based  retrieval  can  also  be  reduced 
to  regularization.  In  the  Markov  random  held  model  of  information  retrieval,  pseudo¬ 
relevance  feedback  is  referred  to  as  latent  concept,  expansion  (LCE)  and  has  very  close  theoretical 
connections  to  relevance  models  [Metzler  and  Croft,  2007],  In  simple  terms,  LCE  works  by 
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performing  expansion  using  expressions  from  an  initial  MRF-based  retrieval  (Section  2.2.3) 
More  formally,  given  PG,A(-D|Q),  then  expansion  expressions  are  weighted  according  to, 

_  .  „  /_  „  .  (d-«)%f  +  «&)M 


Ph.MQ)  =  £  exp  FDq(D,Q)  +  log - A,  M - 

V  (jgj)  T°  ) 

V’  ,  x,  ,  n  ™  r(e \D)XZ 

=  d£  expf^fAQ))— 

=  £  rG.A(g|Q)P(e|P)p° 

P(e|CTT« 

where  each  expression  is  weighted  by  its  collection  frequency  in  a  manner  originally  pro¬ 
posed  by  Li  [Li,  2006],  Let  Zq  =  X)ee£  Ph,a(^\Q)  where  £  is  the  set  of  features  we  are 
selecting  from.  Metzler  and  Croft  [Metzler  and  Croft,  2007]  consider  two  feature  sets: 
terms  and  2-word  proximity  expressions.  In  theory,  £  should  include  all  expressions.  In 
practice,  only  the  expressions  with  the  highest  P/G\(e|Q)  are  considered  for  Zq. 

These  expression  weights  are  used  to  construct  a  second,  weighted  query.  The  document 
scores  after  a  second  retrieval  are  computed  using  a  combined  query, 

PhAD\Q')  =  exp (FDQ'{D,Q’))a  x  exp (Fdq(D,Q))^ 

^  y  ^  ^  y  ^ 

expansion  original  query 

Claim  6.5.  Latent  concept  expansion  is  a  form  of  regularization. 

Proof.  Beginning  with  the  ranking  function  for  the  expanded  query, 

PhAD\Q')  =  exp {FDQ,{D,Q'))a  x  exp (Fdq(D ,  Q))^ 

=  exP  fc  PhAc\Q)  logP(e|D)Nj  x  exp(logPG,A(P>|Q))(1~") 
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If  we  let  y  be  the  length  n  vector  of  original  document  scores  such  that  yt  =  Pg,a(c/,  |Q), 
then  we  can  define  the  updated  scores,  /,  such  that 

OL 

f  =  jnrAy  +  (x  _  a)  los(?/) 

Iblli 

where  A  is  like  an  idf-weighted  cross  entropy  dehned  as 
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This  means  that  LCE  is  theoretically  equivalent  to  a  single  step  of  iterative  regularization 
using  a  concept-based  similarity  measure.  □ 

Metzler  and  Croft  indicate  that  expanding  using  multi-term  expression  never  improved 
retrieval  performance  above  expansion  by  single  terms.  This  reduction  suggests  one  possible 
explanation:  the  accuracy  of  inter-document  similarity  measures  is  usually  not  improved 
by  considering  more  complicated  features.  This  is  consistent  with  the  insignificant  gains 
bigrams  see  in  classification  and  link  detection  tasks  [Bekkerman  and  Allan,  2004;  Nallapati 
and  Allan,  2002] . 

6.4  Laplacian  Eigenmaps 

Score  regularization  can  be  viewed  as  nonparametric  function  approximation.  An  alter¬ 
native  method  of  approximation  reconstructs  y  with  smooth  basis  functions.  When  put  in 
this  perspective,  reconstructing  the  original  function,  y,  using  smooth  basis  functions  indi¬ 
rectly  introduces  the  desired  inter-document  consistency  [Belkin  and  Niyogi,  2003] .  When 
Fourier  analysis  is  generalized  to  the  discrete  situation  of  graphs,  the  eigenvectors  of  A  pro¬ 
vide  a  set  of  orthonormal  basis  functions.  We  can  then  construct  a  smooth  approximation 
of  y  using  these  basis  functions.  In  this  situation,  our  solution  is, 

r  =  E(ETE)“Vy  (6.11) 

where  E  is  a  matrix  consisting  of  the  k  eigenvectors  of  A  associated  with  the  smallest  k 
eigenvalues.  These  eigenvectors  represent  the  low  frequency  harmonics  on  the  graph  and 
therefore  result  in  smooth  reconstruction.3 

Claim  6.6.  Function  approximation  using  harmonic  functions  of  the  document  graph  is  a  form  of  regu¬ 
larization. 

Proof  We  can  view  this  process  from  the  perspective  of  cluster-based  retrieval.  In  the  vector 
space  model,  Equation  6.1 1  can  be  rewritten  as, 

r  =  E(ETE)_1ECq 

=  e(ete)_1]  [ETc]q 

=  VTUTq  (6.12) 

3We  note  that  although  harmonic  reconstruction  has  been  successfully  used  for  text  classification  tasks 
[Belkin  and  Niyogi,  2003],  in  our  experience,  the  approach  does  not  produce  significant  improvements  for 
retrieval.  This  result  follows  from  the  fact  that  the  retrieval  score  functions  are  likely  to  be  much  more  peaked 
than  classification  score  functions  (cf.  Figure  2.1).  Szlam  et  at.  note  [Szlarn  et  ah,  2006,  p.  3],  “The  Fourier 
modes  (j>\  are  global  functions,  and  hence  the  projection  of  a  function  /  onto  the  top  eigenvectors  of  the 
diffusion  operator  is  affected  by  global  properties  of  the  space  and  /  ,  and  may  destroy  important  local  features 

of/.” 
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where  the  k  x  m  matrix  UT  represents  the  basis  as  linear  combinations  of  document  vectors 
and  the  nxk  matrix  VT  projects  documents  into  the  lower  dimensional  space.  In  language 
model  retrieval,  Equation  6.11  can  be  rewritten  as, 


r 


E(EtE)  ~Elog(C)q 


E(ETEj  *  |  |E  log(G)]  q 


VTlog(UT)q 


(6.13) 


where  the  k  X  m  matrix  UT  represents  the  eigenfunctions  as  geometric  combinations  of  doc¬ 
ument  vectors. 

In  both  situations,  new  scores  are  computed  as  functions  of  cluster  scores  and  cluster 
affinities.  Therefore,  we  claim  that  basis  reconstruction  methods  are  an  instance  of  score 
regularization.  □ 


6.5  Link  Analysis  Algorithms 

Graph  representations  often  suggest  the  use  discrete  metrics  such  as  PageRank  to  re¬ 
weight  initial  retrieval  scores  [Brin  and  Page,  1998;  Cohn  and  Hofmann,  2000;  Kleinberg, 
1998;  Kurland  and  Lee,  2005],  These  metrics  can  be  thought  of  as  functions  from  a  docu¬ 
ment  to  a  real  value,  gw  :  'D  —■ ►  9£.  The  function  is  indexed  by  the  weight  matrix  W  because 
these  metrics  are  often  dependent  only  on  the  graph  structure.  Let  g  be  the  length-h  vector 
of  values  of  g  for  our  n  documents.  We  will  refer  to  this  vector  as  the  graph  structure  function. 
The  values  in  g  are  often  combined  with  those  in  y  by  linear  combination  (eg,  f  =  y  +  g) 
or  geometric  combination  (eg,  f  =  y  °  g). 

Many  of  these  methods  are  instances  of  the  spectral  techniques  presented  in  Section  6.4 
[Ng  et  al.,  2001];  specifically,  PageRank  is  the  special  case  where  only  the  top  eigenvector  is 
considered  (ie,  g  =  E]}. 

We  believe  it  is  very  important  to  ask  why  the  graph  represented  in  W  is  being  used  in 
retrieval.  Lor  regularization,  the  matrix  W  by  design  enforces  inter-document  score  consis¬ 
tency.  Lor  hypertext,  the  matrix  W  (by  way  of  g)  provides  the  stationary  distribution  of  the 
Markov  chain  dehned  by  the  hypertext  graph.  This  can  be  a  good  model  of  page  popularity 
in  the  absence  of  true  user  visitation  data.  When  better  user  visitation  information  is  avail¬ 
able,  though,  the  model  provided  by  g  is  less  useful  [Richardson  et  al.,  2006],  When  the 
graph  W  is  derived  from  content-based  similarities,  what  does  g  mean?  It  is  unclear  that 
content-derived  links  can  be  navigational  surrogates;  the  hypothesis  has  never  been  tested. 
Therefore,  applications  of  graph  structure  functions  to  content-based  graphs  seem  weakly 
justihed.  We  believe  that  the  incorporation  of  graph  structure  through  regularization,  by 
contrast,  has  a  more  solid  theoretical  motivation. 

Because  the  structure  information  is  lost  when  computing  g  from  W,  we  cannot  claim 
that  link  analysis  algorithms  are  an  instance  of  regularization. 
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6.6  Spreading  Activation 


When  viewed  as  a  diffusion  algorithm,  our  work  is  also  related  to  the  many  spreading 
activation  algorithms  [Belew,  1989;  Kwok,  1989;  Salton  and  Buckley,  1988;  Wilkinson  and 
Hingston,  1991;  Croft  et  al.,  1988]  and  inference  network  techniques  [Turtle  and  Croft, 
1990;  Metzler  and  Croft,  2004],  In  these  systems,  terms  and  documents  form  a  bipartite 
graph.  Usually  only  direct  relationships  such  as  authors  or  sources  allow  inter-document 
links.  These  algorithms  operate  on  functions  from  nodes  to  real  values,  h  :  {V  U  V}  — > 
T.  The  domain  of  the  functions  includes  both  documents  and  terms.  The  domain  of  the 
functions  in  regularization  includes  only  documents.  Clearly  spreading  activation  is  not  a 
form  of  regularization. 

However,  since  regularization  is  a  subset  of  spreading  activation  techniques,  why  should 
we  study  it  on  its  own?  First,  it  is  not  clear  that  the  smoothness  objective  is  appropriate  for 
heterogeneous  graphs.  The  assertion  that  the  scores  of  a  term  and  a  document  should  be 
comparable  is  tenuous.  Second,  we  believe  that  our  perspective  is  theoretically  attractive 
because  of  its  ability  to  bring  together  several  pseudo-relevance  feedback  techniques  under 
a  single  framework.  Nevertheless,  the  formal  study  of  heterogeneous  nodes  in  a  manner 
similar  to  score  regularization  is  a  very  interesting  area  of  future  work. 

6.7  Relevance  Propagation 

Hypertext  collections  have  inspired  several  algorithms  for  spreading  content-based  scores 
over  the  web  graph  [Qin  et  al.,  2005],  These  algorithms  are  equivalent  to  using  a  hyperlink- 
based  affinity  matrix  and  iterative  regularization.  A  similar  approach  for  content-based 
affinity  has  also  been  proposed  [Savoy,  1997],  The  foundation  of  these  algorithms  is  at  times 
heuristic,  though.  We  believe  that  our  approach  places  regularization — whether  based  on 
hyperlinks  or  content  affinity — in  the  context  of  a  mathematical  formalism. 

6.8  Summary 

In  this  chapter,  we  have  studied  methods  which  directly  and  indirectly  exploit  corpus 
structure.  In  particular,  we  have  examined  these  methods  from  the  perspective  of  score 
regularization.  We  present  a  summary  of  these  results  in  Table  6.1. 

In  the  course  of  our  derivations,  we  have  sought  to  generalize  and  squint  when  necessary 
to  show  similarities  between  algorithms.  In  practice,  the  implementation  of  these  algorithms 
differs  from  what  is  presented  here.  We  believe  these  implementation  differences  explain 
some  performance  differences  and  deserve  more  detailed  analysis. 

A  variety  of  graph  algorithms  exist  which  use  links  based  on  content  and  hyperlinks. 
These  algorithms  often  are  very  subtle  variations  of  each  other  when  analyzed.  We  hope 
that  our  discussion  will  provide  a  basis  for  comparing  graph-based  and  corpus  structure 
algorithms  for  information  retrieval. 

We  have  restricted  our  discussion  of  scoring  algorithms  to  two  popular  approaches:  vec¬ 
tor  space  retrieval  and  retrieval  of  language  models.  Certainly  other  models  exist  and  de¬ 
serve  similar  treatment.  This  chapter  should  provide  a  perspective  not  on  only  analyzing 
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score 


Vector  Space  Model 

Query  Expansion 

Document  Expansion 
Cluster-based  Retrieval 

Ay  +  y 

Wy  +  y 

VTyc  +  y 

Language  Modeling 

Query  Expansion 

Document  Expansion 
Cluster-based  Retrieval 

Cluster  Interpolation 

Ay  +  y 

log(AC  +  C)q 
log(VTUT  +  C)q 
Weye  +  y 

Feature-Based  Retrieval 

Query  Expansion 

Ay  +  y 

Regularization 

Iterative  Regularization 

Closed  Form  Regularization 

Wy  +  y 

(aA  +  (1  -  a)I)_1 

Laplacian  Eigenmaps 

Wcyc 

PageRank 

Ei  o  y 

Table  6.1.  Comparison  of  corpus  modeling  and  graph-based  algorithms.  Model-specific 
constants  and  parameters  have  been  omitted  for  clarity. 


query  expansion,  regularization,  and  document  expansion  in  other  frameworks  but  also  on 
developing  query  expansion,  regularization,  and  document  expansion  for  new  frameworks. 

Finally,  we  believe  that  the  results  of  this  chapter  indicate  that  the  improvement  of  local 
score  consistency  explains  some  of  the  success  of  previous  approaches.  However,  we  note 
that  few  of  these  approaches  directly  incorporate  consistency,  often  opting  instead  for  appli¬ 
cation  of  what  amounts  to  a  single  iteration  of  regularization.  We  further  believe  that  the 
direct  and  formal  incorporation  of  consistency  provides  a  compelling  area  for  future  work. 
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CHAPTER  7 


STABILITY  OF  REGULARIZATION 


The  fundamental  data  structure  in  our  regularization  algorithm  is  the  inter-document 
affinity  matrix.  According  to  van  Rijsbergen,  text  affinity  measures  share  heuristics  which 
result  in  very  similar  behavior  [van  Rijsbergen,  1979,  page  24], 

Informally  speaking,  a  measure  of  association  increases  as  the  number  or  pro¬ 
portion  of  shared  attribute  states  increases.  Numerous  coefficients  of  association 
have  been  described  in  the  literature,  see  for  example  Goodman  and  Kruskal, 
Kuhns,  Cormack  and  Sneath  and  Sokal.  Several  authors  have  pointed  out  that 
the  difference  in  retrieval  performance  achieved  by  different  measures  of  asso¬ 
ciation  is  insignificant,  providing  that  these  are  appropriately  normalised.  Intu¬ 
itively,  one  would  expect  this  since  most  measures  incorporate  the  same  infor¬ 
mation.  Lerman  has  investigated  the  mathematical  relationship  between  many 
of  the  measures  and  has  shown  that  many  are  monotone  with  respect  to  each 
other.  It  follows  that  a  cluster  method  depending  only  on  the  rank-ordering  of 
the  association  values  would  given  identical  clusterings  for  all  these  measures. 


In  Section  2.3,  we  described  severals  ways  to  define  this  matrix.  In  this  chapter,  we  will 
establish  theoretical  bounds  and  present  empirical  evidence  of  the  effect  different  similarity 
measures  have  on  the  stability  of  regularization. 

We  will  view  regularization  as  the  solution  of  a  linear  system.  We  rewrite  the  closed  form 
version  of  regularization  (Equation  5.1 1)  as, 

(rv^A+I)f*  =  y  <7-» 

where  the  Laplacian,  A,  is  associated  with  the  matrix,  W,  generated  by  some  arbitrary 
similarity  measure  (for  example,  cosine  similarity). 

We  consider  a  matrix,  W,  generated  by  a  different  similarity  matrix  (for  example  Hellinger 
similarity).  The  regularized  scores  using  this  alternative  similarity  measure  is  the  solution  to 
the  linear  system  in  Equation  7. 1  using  W.  Let  A  be  the  Laplacian  of  W.  The  linear  system 
for  the  perturbed  matrix  is, 

(r^A+I)i,=y  (7'2) 

We  would  like  to  bound  the  difference  in  regularized  scores  given  differences  in  the 
similarity  matrix.  We  will  measure  the  change  in  regularized  scores  using  the  relative  error 
between  scores, 


Iff -fj|2 


(7.3) 
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We  will  measure  the  difference  in  the  similarity  matrix  according  to  the  changes  in  the 
associated  Laplacians, 


A- A 


2 


where  the  matrix  norm  is  induced  from  the  vector  2-norm. 


Theorem  7.1. 


If*  -  FI 


< 


a 


a 


A- A 


(7.4) 


Proof.  We  will  be  treating  the  solution  in  Equation  7.2  as  the  solution  to  a  perturbed  version 
of  Equation  7.1  [Stewart  and  Sun,  1990],  To  this  end,  we  rewrite  Equation  7.2  to  show  the 
perturbation  more  explicitly, 

(A  +  £)f*  =y 


where 


A  = 


a 


£  = 


1  —  a 

a 


1  —  a 
A  =  A  +  £ 
a 


A  + 1 
(A -A) 


A  + 1 


1  —  a 

The  difference  between  solutions  is  then  defined  as, 

F  —  F  =  A_1y  -  A-1y 
=  (A-1-A-1)y 

Because  A  and  A  are  nonsingular, 

A-1A  =  I 

A_1A  +  ATl£  =  A-1  A 
A"1  +  A-^A’1  =  A^1 

A-1£A~1  =  A^1  -  A’1 


Therefore, 


F  -  f*  =  A-1£A-1y 
=  A-1£F 

||F  -  f*||  <  ||A-1||||£||||F| 
|f  11  <  IIA~1||||£|| 


(7.5) 
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Now,  we  compute  the  value  of  ||A  1||.  First,  we  can  show  that  A  is  positive  definite. 
That  is,  x  tAx  >  0  for  all  x  >  0.  The  proof  is  rather  straightforward,  using  the  fact  that  the 
Laplacian  is  positive  semidehnite  [Chung,  1997], 

xtAx  =  xT  ( — A  +  l)x 

VI  -  a  J 

Qi  T  A  T 

=  - x'Ax  +  xTx 

1  —  a 

>  xTx 

>  0 


Given  this,  we  know  that,  for  positive  definite  matrices,  ||A  x||  =  - — -1— where  Amm(A) 

Amin 

is  the  minimum  eigenvalue  of  A.  We  can  derive  the  following  relationship  between  the 
eigenvalues  of  A  and  A, 


Ax 


a 


1  —  a 

a 


-A  + 1 


Ax 


X 


X 


a 


a 


-Ax 


a 


Ax 


Ax 

Ax 


Ax 

(A  -  l)x 
(A  -  1)(1  -  q) 

- X 

a 


The  minimum  eigenvalue  of  the  Laplacian  is  0  [Chung,  1997],  Therefore,  the  minimum 
eigenvalue  of  A  is  1  and  |A  1 |  =  1.  Substituting  this  value  into  Equation  7.5  completes 
our  proof. 


□ 


Remark  7.1.  ||A  —  A||  <  2 

Proof.  Because  A  and  A  are  symmetric,  by  Fischer’s  theorem  [Stewart  and  Sun,  1990,  Corol¬ 
lary  IV4.7],  we  can  establish  the  following  bound  on  the  norm  of  their  difference, 


A- A 


—  Amaoi(A  A) 

fr  A max  (A)  +  A max  (-A) 

—  Amax(A)  +  0 

<  2 


□ 

We  depict  the  bound  on  in  Figure  7.1.  The  general  behavior  of  this  bound  con¬ 

firms  an  intuition  we  might  have  already.  For  low  values  of  a,  when  two  affinity  measures 
are  very  similar,  their  regularized  scores  are  very  similar.  In  fact,  for  low  values  of  a,  the 
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Figure  7.1.  Bound  on  regularization  error  given  similarity  matrix  perturbations  and  a.  The 
solid  horizontal  line  represents  the  empirical  mean  perturbation  found  in  our  experiments. 
The  dashed  lines  represent  one  standard  deviation.  This  graph  is  ideally  viewed  in  color. 
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Figure  7.2.  Empirical  differences  in  regularized  scores  as  a  function  of  a  for  a  retrieval 
from  our  experiments.  This  dashed  line  in  this  graph  represents  the  theoretical  bound  and 
therefore  is  a  cross-section  of  surface  from  Figure  7.1. 


regularization  is  quite  robust  to  arbitrary  differences  in  the  affinity,  ffowever,  as  we  regu¬ 
larize  more  aggressively  by  using  a  higher  a,  the  regularized  solutions  are  more  sensitive  to 
perturbations  of  the  affinity  matrix. 

The  range  of  differences,  0  <  ||  A  —  A||  <  2,  includes  arbitrary  matrix  perturbations. 
In  reality,  our  perturbations  are  likely  to  be  constrained  to  differences  much  less  than  the 
maximum.1  In  order  to  measure  the  empirical  perturbations,  we  computed  the  differences 
between  Laplacians  using  cosine  similarity  and  the  Hellinger  distance  over  all  retrievals  in 
trecl2/QL.  We  found  that  the  mean  value  of  ||  A  —  A||  was  0.541  ±  0.0585;  this  range  is 
depicted  in  Figure  7.1.  An  expected  perturbation  in  this  range  indicates  that  regularized 
scores  will  be  very  similar  for  0  <  a  <  0.5.  We  plot  the  empirical  regularization  differences 
for  various  a  for  a  fixed  queiy  with  ||  A— A||  =0.510  in  Figure  7.2.  From  this  figure,  it  should 
be  clear  that  our  bound,  because  it  considers  arbitrary  perturbations,  is  somewhat  loose. 
The  empirical  evidence  from  other  queries  indicates  that  the  actual  differences  between 
these  two  affinity  measures  is  likely  to  be  far  below  our  bound. 

The  bound  established  in  Theorem  7 . 1  measures  the  effect  of  perturbations  on  norm  of 
the  difference  between  the  regularized  scores.  Because  information  retrieval  is  often  evalu¬ 
ated  by  the  induced  ranking,  it  is  worth  exploring  the  effect  on  rankings  resulting  from  per¬ 
turbations.  Therefore,  for  each  pair  of  regularized  rankings  in  our  experiment,  we  compute 
the  Plantagenet  coefficient  of  rank  similarity  [Genest  and  Plante,  2003] .  The  Plantagenet 
is  defined  as, 

1  Because  we  are  constraining  our  analysis  to  symmetric  matrices  perturbed  by  symmetric  matrices,  we  might 
believe  that  the  bound  is  in  fact  much  smaller.  However,  Higham  proved  that  such  constraints  in  fact  do  not 
change  the  condition  number  and  therefore  we  suspect  that  symmetric  perturbation  may  not  affect  our  bound 
[Higham,  1995], 
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Figure  7.3.  Empirical  relationship  between  a  and  the  Plantagenet  coefficient. 


An +  5  6  " 

- T  +  a - Z. 

n  —  1  n6  —  n  , 
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f4  -x‘  +  to) 

V  n  +  1  ) 


(7.6) 


where  x  and  y  are  vectors  containing  ranks  of  each  document.  The  Plantagenet  coefficient 
is  a  version  of  Spearman  correlation  sensitive  to  changes  in  the  top  ranks.  This  measure  is 
appropriate  because,  when  comparing  two  rankings,  we  are  most  concerned  with  changes 
of  the  ranks  of  top-ranked  documents.  In  Figure  7.3,  we  plot  this  correlation  as  a  function  of 
a.  This  figure  gives  us  an  intuition  for  the  perceptible  changes  resulting  from  using  different 
similarity  measures.  In  fact,  we  see  that,  for  all  values  of  a,  we  achieve  strong  correlation 
between  rankings.  This  indicates  that  11  11  gives  us  an  accurate  representation  of  the  effect 

of  affinity  perturbation. 

Finally,  we  can  also  measure  the  effect  of  perturbation  on  differences  in  performance. 
In  Figure  7.4,  we  plot  the  relative  differences  in  average  precision  resulting  from  changes 
in  the  affinity  measure.  We  notice  that  the  mean  relative  change  in  average  precision  is  less 
than  10%  for  all  values  of  a.  This  again  indicates  that  regularization  is,  on  average,  not 
sensitive  to  the  similarity  measure.  Nevertheless,  some  of  the  outlying  queries  seem  to  be 
more  sensitive  to  a  than  the  mean. 

In  summary,  we  have  studied  the  stability  of  regularization  subject  to  changes  of  the  pa¬ 
rameter  a.  We  found  that,  for  small  values  of  a,  solutions  are  robust  to  small  perturbations 
in  the  similarity  matrix.  For  more  aggressive  regularization,  solution  are  more  sensitive  to 
perturbations  in  the  similarity  matrix.  We  complemented  these  theoretical  results  with  em- 
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Figure  7.4.  Empirical  relationship  between  a  and  the  relative  change  in  average  precision. 
The  performance  using  cosine  similarity  are  used  as  the  baseline. 
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pirical  measurements  of  the  effect  of  similarity  matrix  perturbations.  We  found  that  the  dif¬ 
ferences  between  vector  space  model  and  language  model  similarities  resulted  only  in  slight 
differences  in  regularized  scores,  rank  ordering  of  regularized  scores,  and  performance. 
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CHAPTER  8 


EXTENSIONS  AND  FUTURE  WORK 


We  have  presented  a  theoretical  and  experimental  analysis  of  score  regularization.  In 
this  chapter,  we  will  describe  several  extensions  to  the  framework  which  demonstrate  its 
applicability  to  relevance  feedback,  cross-lingual  retrieval,  optimal  set  retrieval,  and  cross¬ 
media  retrieval. 


8.1  Relevance  Feedback 


So  far  in  this  thesis,  a  system  has  been  evaluated  based  on  a  single  retrieval  given  a 
short  query.  In  some  situations,  the  user  also  supplies  sample  relevant  and  non-relevant 
documents.  These  judgments  may  be  provided  with  the  query  or  in  response  to  some  initial 
query  probing  the  collection  for  documents.  The  second  scenario,  referred  to  as  relevance 
feedback ,  will  be  the  focus  of  this  section. 

All  of  the  retrieval  models  in  Section  2.2  have  different  methods  for  incorporating  rel¬ 
evance  judgments  in  interactive  retrieval.  We  will  be  focusing  on  the  language  modeling 
approach.  In  Equation  2.12,  when  relevance  judgments  are  absent,  we  estimate  a  language 
model  of  relevance  using  a  weighted  combination  of  documents  in  an  initial  retrieval.  When 
document  relevance  information  is  provided,  we  can  estimate  the  true  relevance  model  directly 
with  binary  weights  [Lavrenko,  2004,  p.  69], 

p(m\eR)  =  XP(W\0Q)  +  (1  -  A)  £  friPHOd) 

de n+  l/c  I 


where  1Z+  is  the  set  of  documents  judged  relevant.  In  matrix  notation,  we  represent  this  as 

(1-A), 


q  =  Aq 


\1V 


CTr 


(8.1) 


where  r  is  an  n  x  1  vector  where, 


Vi  = 


1  if  i  e  n+ 

0  otherwise 


Documents  are  then  ranked  according  to  cross  entropy, 

f  =  log(C)q 


(8.2) 


(8.3) 


We  note  that  there  is  no  formal  model  of  non-relevance  in  relevance  feedback  based  on 
true  relevance  models.  True  relevance  models  approach  information  retrieval  from  the  per¬ 
spective  of  density  estimation.  Relevant  examples  provide  statistics  for  the  true  relevance 
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(a)  initial  score  function  (b)  relevant  documents 

Figure  8.1.  A  high-scoring  non- relevant  cluster.  The  figure  on  the  left  depicts  the  scores 
on  the  document  graph.  On  the  right,  we  show  the  relevance  of  each  document.  Red  nodes 
indicate  relevant  documents.  Black  nodes  indicate  non-relevant  documents. 


model.  The  non-relevance  model,  by  default,  is  the  language  model  estimated  using  collec¬ 
tion  statistics,  P(w\6c)-  As  a  naive  approach  to  incorporating  non-relevant  documents,  we 
might  add  them  to  the  documents  used  to  model  non-relevance.  However,  since  the  major¬ 
ity  of  the  collection  is  non-relevant,  the  information  from  additional  non-relevant  documents 
would  be  washed  out.  This  might  be  seen  as  a  minor  theoretical  detail  given  the  empirical 
evidence  that  negative  feedback  does  not  result  in  significant  improvements  [Aalbersberg, 
1992;  Dunlop,  1997].  However,  we  believe  that  the  information  in  explicitly  non-relevant 
documents  can  useful  in  situations  where  no  relevant  documents  are  retrieved  and  the  sys¬ 
tem  must  filter  non-relevant  information.  For  example,  if  the  only  known  keywords  for  a 
topic  retrieve  a  cohesive,  non-relevant  cluster,  we  would  like  to  provide  information  to  re¬ 
move  the  entire  non-relevant  cluster.  We  demonstrate  this  behavior  in  Figure  8.1.  The 
higher-scoring  but  non-relevant  documents  fall  into  one  cluster  while  lower-scoring  but  rel¬ 
evant  documents  fall  into  another  cluster.  Modeling  non-relevance  would  allow  the  system 
to  effectively  down-weight  all  documents  in  the  non-relevant  cluster. 

Whereas  true  relevance  modeling  is  a  non-parametric  density  estimation  method,  reg¬ 
ularization  is  a  non-parametric  function  approximation  method.  One  advantage  of  ap¬ 
proaching  this  as  function  approximation  is  that  we  can  explicitly  model  non-relevance  dif¬ 
ferently  than  we  do  uncertainty.  Put  another  way,  in  Equation  8.2,  a  score  of  0  represents 


91 


both  non-relevant  documents  and  unjudged  documents.  In  regularization,  if  we  normalize 
scores  to  zero  mean  and  unit  variance,  we  can  explicitly  model  relevant  document  scores 
(eg,  yi  >  1),  non-relevant  document  scores  (eg,  yl  <  —1),  and  unjudged  documents  (eg, 
Vi  =  0). 

In  order  to  evaluate  a  relevance  feedback  method,  we  measure  the  performance  of  the 
system  after  receiving  feedback  on  the  top  k  documents.  We  evaluated  the  following  model. 
After  the  user  marks  the  top  k  documents  as  relevant  or  non-relevant,  we  re-issue  the  query 
using  the  true  relevance  model  approach  (Equation  8.1).  We  normalize  the  true  relevance 
model  scores,  y,  to  zero  mean  and  unit  variance.  For  each  relevant  document,  we  replace 
its  score  with  a  value  sampled  from  the  region  of  the  Gaussian  greater  than  the  maximum 
score.  We  do  the  same  replacement  for  each  non-relevant  document  by  using  samples  from 
the  bottom  region  of  the  Gaussian.  Given  these  adjusted  scores,  we  perform  our  standard 
regularization. 

We  present  the  results  of  these  experiments  in  Figure  8.2.  Given  the  results  in  Chapter 
5,  we  should  not  be  surprised  that  regularization  consistently  improves  the  performance  of 
retrieval.  The  interesting  aspect  of  these  plots  is  that  the  amount  of  improvement  grows 
with  the  number  of  relevance  judgments.  We  suspect  that,  as  the  number  of  judgments 
increases,  the  regularization  component  of  the  system  becomes  more  important  because 
the  additional  data  introduces  a  more  discriminative  component  to  standard  true  relevance 
models,  allowing  us  to  take  advantage  of  additional  data  [Ng  and  Jordan,  2002] . 

8.2  Cross-Lingual  Regularization 

Cross-lingual  information  retrieval  refers  to  the  task  where  a  user  is  interested  in  doc¬ 
uments  written  in  a  foreign  or  target  language  and  provides  a  query  in  her  native  or  source 
language.  Traditional  approaches  to  this  problem  usually  perform  some  sort  of  query  trans¬ 
lation  from  the  source  to  the  target  language.  In  this  section,  we  will  describe  a  technique  for 
performing  cross-lingual  information  retrieval  without  translating  the  query  or  performing 
a  second  retrieval.  We  refer  to  this  technique  as  cross-lingual  score  regularization. 

Formally,  we  have  a  target  collection  of  nt  documents.  Some  small  number,  ns ,  of  the 
target  documents  have  been  translated  into  the  source  language.  Sets  of  translated  col¬ 
lections  are  common  in  the  machine  translation  community  and  are  referred  to  as  parallel 
corpora.  We  will  further  assume  that,  given  a  query  in  the  source  language,  we  have  some 
method  for  scoring  the  source  language  documents.  For  example,  we  might  use  one  of  the 
methods  from  Section  2.2. 

8.2.1  Cross-Lingual  Score  Regularization 

Cross-lingual  regularization  refers  to  the  process  of  scoring  the  source  parallel  documents 
and  then  assuming  that  the  ns  parallel  target  documents  should  have  the  same  score.  If  the 
user  were  interested  in  retrieving  the  parallel  target  documents,  the  retrieval  process  could 
terminate  at  this  stage.  However,  the  user  is  more  often  interested  in  those  target  documents 
which  do  not  have  source  translations.  We  will  score  these  non-parallel  target  documents  by 
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Figure  8.2.  Relevance  feedback  results.  The  horizontal  axis  indicates  the  number  of  feed¬ 
back  documents  judged  from  the  initial  retrieval.  The  vertical  axis  plots  mean  average  pre¬ 
cision  of  that  retrieval.  All  regularized  retrievals  are  rerankings  of  the  true  relevance  model 
runs. 
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using  the  score  information  from  the  parallel  documents.  We  depict  this  process  graphically 
in  Figure  8.3. 1 

Assume  that  the  translated  documents  are  all  indexed  identically  from  0  to  ns  for  both 
corpora  and  that  we  have  an  nt  X  nf  affinity  matrix  for  the  target  collection  of  documents. 
Then  the  regularized  target  corpus  scores  are  defined  by  the  vector  minimizing 

yj  =  S' (ft)  +  /j£'(ft,  ys)  (8.4) 

where  ys  is  the  ns  x  1  vector  of  source  collection  scores  and  f,  is  the  nt  x  1  vector  of  regularized 
target  collection  scores.  The  constraints  are  defined  as, 

S' (ft)  =  fjAtft  (8.5) 

£'({t,ys)  =fJtyt  (8.6) 

where  y(  =  [yj0T]T  is  a  vector  of  projected  scores.  This  problem  has  similar  solutions  to 
monolingual  regularization.  The  iterative  solution  is, 

f,+1  =  (1  -  o)y,  +  oS4  (8.7) 

The  closed  form  solution  is, 

f^  =  (l-«)(aAi  +  (l-a)I)'1yt  (8-8) 

where  a  =  77—. 

i +/* 

8.2.2  Cross-Lingual  Relevance  Models 

Let  6a  refer  to  a  language  model  over  the  target  vocabulary;  similarly,  0h  models  the 
source  language.  If  we  have  a  query  in  the  target  language,  we  score  each  target  document, 
d,  according  to  the  query  likelihood,  P(Q\6hd).  The  cross-lingual  relevance  model  is  estimated 
as, 

P(wK)  =  £  FAMlp(w |d»)  (8,9) 

d  ^ 

The  difference  between  the  cross-lingual  relevance  model  and  the  standard  relevance  model 
(Equation  2. 12)  is  that  we  are  applying  the  score  for  a  source  document  to  the  parallel  target 
document.  This  lets  us  build  a  relevance  model  in  the  target  language  using  source  docu¬ 
ment  scores  as  the  interpolation  weights  solving  our  problem  of  not  having  a  query  in  the 
target  language.  In  matrix  notation, 


where  C t  is  our  target  collection  and  ys  is  a  nt  X  1  vector  where  the  ns  documents  with 
translations  are  scored  by  P(Q\9^)  and  the  rit  —  ns  target-only  documents  receive  a  score 
of  0.  We  can  now  use  a  cross-entropy  scoring  function  to  rank  target  documents, 

f,  =  log(C,)q,  (8.10) 

1  In  the  context  of  cross-lingual  link  detection,  we  used  similar  techniques  successfully  [Diaz  and  Metzler, 
2007], 
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Figure  8.3.  Cross-collection  regularization  by  score  projection.  Documents  in  the  paral¬ 
lel  corpus  are  represented  as  brown  circles.  Documents  of  interest  in  the  target  language 
are  represented  as  white  circles.  Bold  numbers  represent  scores  of  the  parallel  documents. 
Unbolded  numbers  represent  interpolated  scores. 
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CLRM 

CLSR 

0.00 

0.5694 

0.6238 

0.10 

0.3737 

0.4456 

0.20 

0.3194 

0.3535 

0.30 

0.2789 

0.2943 

0.40 

0.2424 

0.2502 

0.50 

0.2049 

0.2010 

0.60 

0.1673 

0.1520 

0.70 

0.1301 

0.0989 

0.80 

0.0916 

0.0536 

0.90 

0.0361 

0.0154 

1.00 

0.0000 

0.0000 

map 

0.2027 

0.2064 

Table  8.1.  Cross-lingual  relevance  models  compared  to  cross-lingual  score  projection. 


Theorem  8.1.  Cross-lingual  relevance  models  are  a  form  of  cross-lingual  regularization. 

Proof  The  proof  follows  that  of  Theorem  6.4.  Starting  from  the  ranking  function, 

yt  =  log(cf)qf 

=  log(ct)cJys 
llyji 

°cAfys  (8.11) 

where  Af  is  an  nt  x  nt  affinity  matrix  based  on  inter-document  cross-entropy  between  the 
target  documents.  □ 

8.2.3  Experiments 

We  compared  the  performance  of  cross-lingual  score  regularization  (CLSR)  to  cross- 
lingual  relevance  models  (CLRM)  using  a  cross-lingual  retrieval  task  involving  a  source 
query  written  in  English  and  a  target  collection  written  in  Mandarin  [Smeaton,  1996], 

We  present  results  in  Table  8.1.  Perhaps  due  to  the  similarity  of  the  approaches,  there  is 
no  statistical  difference  between  CLRM  and  CLSR.  LTpon  investigation  of  the  results,  how¬ 
ever,  we  notice  that  CLSR  tends  to  perform  better  at  the  top  of  the  ranked  list  while  CLRM 
performs  better  at  low-recall  areas.  To  explore  this  further,  we  can  look  at  the  performance 
for  high  precision  measures.  In  Table  8.2,  we  evaluate  each  algorithm  by  the  precision  for 
the  top  k  documents.  Although  mostly  statistically  similar,  CLSR  performs  significantly 
better  when  considering  the  top  1 0  documents. 
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P@K 

CLRM 

CLSR 

5 

0.3556 

0.4185 

10 

0.3167 

0.4037 

15 

0.3123 

0.3617 

20 

0.3102 

0.3389 

30 

0.3006 

0.3228 

Table  8.2.  Cross-lingual  relevance  models  compared  to  cross-lingual  score  projection. 


8.3  Future  Work 

8.3.1  Optimal  Cluster  Retrieval 

Sometimes  the  user  is  interested  in  being  provided  a  high  precision  set  of  documents 
instead  of  a  ranking  of  the  entire  collection  [Dai  and  Srihari,  2005;  Griffiths  et  al.,  1986; 
Hearst  and  Pedersen,  1996;Jardine  and  van  Rijsbergen,  1971;  Liu  and  Croft,  2006],  This  is 
important  when  the  information  retrieval  user  is  an  automatic  process  such  as  a  text  summa¬ 
rization  system.  Previous  approaches  to  this  task  attempt  to  detect  a  single,  tight  cluster  in  an 
initial  retrieval.  An  alternative  approach,  suggested  by  our  score  regularization  framework, 
treats  this  as  an  optimization  problem. 

In  regularization,  we  are  concerned  with  document  scores  which  induce  a  ranking;  that 
is,  f  G  E".  In  optimal  cluster  retrieval,  we  are  concerned  with  document  scores  which 
induce  a  partition;  that  is  f  G  {0,  l}n  where  documents  with  a  score  of  1  are  retrieved.  The 
principle  objective  of  optimal  cluster  retrieval  is  that  the  retrieved  set  be  on  the  same  topic. 
One  way  of  measuring  this  property  is  to  inspect  the  local  relationships  in  the  set, 

fTWf  E,j  Wijfifi 
11*11  E,:  ft 

When  the  value  of  this  objective  is  small,  documents  in  the  set  are  unrelated  to  each  other; 
when  it  is  large,  the  documents  in  the  set  have  high  similarity.  Notice  that  this  is  equal  to  the 
Moran  autocorrelation  off  (Equation  4.5).  Although  the  similarity  of  documents  within  the 
set  is  important,  we  might  alternatively  be  interested  in  the  retrieved  cluster  being  dissimilar 
from  the  rest  of  the  corpus. 

fTAcf  Eij  Wijift  -  f,)2 
ll«ll  E./.2 

which  is  equivalent  to  a  graph  min-cut  objective  or — in  spatial  data  analysis — the  Geary 
autocorrelation  [Cliff  and  Ord,  1973], 

Unfortunately,  these  purely  graph-based  objectives  ignore  the  relevance  of  the  docu¬ 
ments,  y,  potentially  resulting  in  retrieval  of  clusters  of  documents  which  are  non-relevant. 
In  Figure  8.4,  we  present  a  situation  where  ignoring  the  documents  scores  may  lead  to  the 
selection  of  low-scoring  documents.  This  figure  also  demonstrates  that  the  optimal  set  may 
consist  of  documents  from  a  portion  of  a  cluster.  In  order  to  address  this  we  can  develop 
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Figure  8.4.  Retrieval  function  for  the  query  “nuclear  proliferation”.  This  is  a  function 
which  is  not  consistent  with  the  topology  of  the  document  graph. 


additional  constraints  on  f  that  ensure  that  we  select  high-scoring  documents.  The  easiest 
constraint  would  simply  select  a  subset  with  a  high  average  score, 

fTy  E  ifiVi 
llflli  Ei/i 

Alternatively,  we  could  consider  other  measures  of  incorporating  the  scores  such  geometric 
mean  or  variance.  One  particularly  interesting  measure  would  be  the  smoothness  along  the 
set  boundary 

E  E  Wafa-Vj)2 

Wi-iHil/j-o} 

This  objective  would,  to  some  extent,  detect  local  “patches”  of  relevant  documents  nested 
in  some  larger  cluster. 

Selecting  and  combining  these  objectives  is  not  trivial.  Although  very  similar  to  the 
isoperimeter  problem,2  the  fact  that  our  documents  have  scores  associated  with  them  makes 
the  problem  a  more  difficult  boolean  programming  task.  Semidefinite  relaxations  of  this 
problem  may  provide  good  approximate  solutions  [Vandenberghe  and  Boyd,  1996;  Poljak 
et  al.,  1995], 

8.3.2  Incorporating  Regularization  into  Formal  Models 

Local  score  regularization  is  presented  as  a  fix  for  existing  retrieval  methods  which  ignore 
local  consistency.  One  of  the  primary  goals  of  this  thesis  is  to  prompt  the  introduction  of 

2  The  smallest-enclosing  hypersphere  problem  for  continuous  spaces  or  isoperimetric  set  problem  for  graphs 
refer  to  the  task  of  finding  point  sets  of  maximum  support  ([Scholkopf  et  al.,  2001],  [Chung  et  al.,  2000],  and 
[Grady  and  Schwartz,  2006]  provide  starting  points  for  this  literature). 
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regularity  as  a  design  principle  for  new  retrieval  systems  and  models.  We  have  demonstrated 
that,  for  some  retrieval  scenarios,  local  consistency  can  significantly  improve  performance. 
We  believe  that  the  direct  introduction  of  local  consistency  in  formal  models  can  result  much 
stronger  improvements.  As  discussed  in  Chapter  6,  pseudo-relevance  feedback  captures  this 
to  a  certain  extent.  Finding  approaches  which  model  local  consistency  using  more  than  a 
single  iteration  of  regularization  is  a  compelling  problem  and  a  worthwhile  research  direc¬ 
tion. 

8.3.3  Cross-Media  Regularization 

In  the  same  way  we  projected  document  scores  across  languages,  we  can  consider  pro¬ 
jecting  scores  across  media.  Cross-media  information  retrieval  refers  to  the  scenario  where 
the  user  poses  a  query  in  one  media,  for  example  text,  and  expects  results  in  some  other 
media,  for  example  images.  In  cross-media  regularization ,  we  treat  the  collections  in  Figure  8.3 
as  coming  from  different  media.  However,  we  need  to  ask  ourselves  two  questions.  First,  do 
we  have  a  meaningful  parallel  corpus?  The  answer  to  this  question  depends  on  the  task  but 
in  some  situations,  we  can  reply  affirmatively.  For  example,  images  and  movies  often  can  be 
associated  with  explicit  keywords  or,  when  available,  the  text  in  which  it  is  situated.  Second, 
we  must  ask  whether  an  appropriate  affinity  measure  exists  in  the  target  corpus.  Certainly 
we  have  presented  substantial  evidence  that,  for  text,  content-based  similarity  measures  are 
appropriate  for  topic -based  retrieval.  It  is  less  clear  that  we  can  make  similar  claims  about 
other  media.  We  have  some  evidence  that  appropriate  similarity  measures  exist  for  some 
images  but,  in  general,  this  is  an  open  research  question  [He  et  ah,  2004;  Shi  and  Malik, 
2000], 

8.3.4  Diffusion  Wavelets 

We  mentioned  in  passing  that  regularization  using  lower  order  harmonic  functions  on 
the  graph  did  not  improve  performance  as  much  as  the  local  approach  we  take.  Although  we 
can  argue  that  the  peaked  nature  of  the  retrieval  function  precludes  harmonic  reconstruc¬ 
tion,  this  does  not  imply  that  our  local  approach  is  necessarily  the  best  approach.  For  exam¬ 
ple,  multi-scale  analysis  and  diffusion  wavelets  [Bremer  et  al.,  2004;  Coifman  and  Maggioni, 
2004;  Szlam  et  ah,  2006]  would  provide  a  middle  ground  between  regularization  based  on 
global  analysis  and  regularization  based  purely  on  local  analysis. 

8.3.5  Robust  Regularization 

One  of  the  assumptions  underlying  regularization  is  that  all  initial  retrieval  scores  are 
equally  valid.  This  is  represented  in  the  error  cost, 

h 

£(f,y)  = 

i=i 

Retrieval  algorithms  rarely  are  equally  confident  about  document  scores.  An  system  is  often 
more  confident  about  scores  of  high-scoring  documents  than  low-scoring  documents.  Un¬ 
fortunately,  our  constraint  is  ignorant  of  these  subtleties.  In  reality  what  we  would  like  is  for 
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our  constraint  to  penalize  inconsistencies  with  high-scoring  documents  more  than  inconsis¬ 
tencies  with  low-scoring  documents.  One  way  to  introduce  this  adaptive  weight  is  to  define 
a  new  error  cost, 


fs(f,y)  =  £  »(»)(/<  -Vi)2  (8.12) 

i=  1 

where  g  is  a  monotonically  decreasing  function  of  the  rank  of  the  document  i;  lower- ranked 
documents  contribute  less  to  the  cost.  This  type  of  adaptive  weighting  would  result  in  a 
regularization  in  which  low-scoring  documents  related  to  high-scoring  documents  would 
bubble  up  the  ranked  list  without  allowing  high-scoring  documents  to  be  weighed  down  by 
low-scoring  neighbors. 
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CHAPTER  9 


CONCLUSIONS 


We  began  this  thesis  by  describing  the  cluster  hypothesis  as  a  design  principle  for  in¬ 
formation  retrieval  systems.  In  the  course  of  this  dissertation,  we  have  developed  methods 
for  measuring  the  satisfaction  of  this  principle  (Chapters  3  and  4),  demonstrating  that  local 
consistency  correlates  positively  with  system  performance.  In  Chapter  5,  we  moved  from 
this  correlation  result  to  provide  evidence  of  a  causal  relationship  between  local  consistency 
and  performance.  Our  technical  contributions  to  information  retrieval  include 

1 .  A  formal  measure  of  local  consistency.  We  derived  autocorrelation  directly 
from  the  Voorhees  test,  allowing  it  to  be  used  to  measure  the  degree  to  which  a  system 
satisfies  the  cluster  hypothesis. 

2.  A  demonstration  of  the  correlation  between  local  consistency  and  perfor¬ 
mance.  We  presented  empirical  evidence  that  shows  the  relationship  between  local 
consistency  and  performance. 

3.  A  demonstration  that  improving  the  local  score  consistency  of  a  system 
using  a  Laplacian-based  approach  improves  performance.  We  presented  an 
algorithm  which  used  the  graph  Laplacian  to  improve  the  local  consistency  of  retrieval 
score  functions.  This  demonstrated  a  causal  relationship  between  local  consistency 
and  performance. 

4.  A  regularization-based  perspective  on  pseudo-relevance  feedback.  We  pre¬ 
sented  an  extended  discussion  of  the  relationship  between  regularization  and  previous 
research,  concluding  that  some  of  the  success  of  these  methods  may  be  explained  by 
their  effect  on  local  score  consistency. 

These  technical  contributions  all  advance  the  understanding  the  cluster  hypothesis  in  infor¬ 
mation  retrieval. 

In  addition  to  these  theoretical  contributions,  our  work  has  resulted  in  several  practical 
contributions  to  information  retrieval, 

1 .  A  novel  precision  prediction  method  directly.  We  developed  a  new  precision 
predictor  which  is  competitive  with  state  of  the  art  precision  predictors  and  improve 
performance  when  used  in  combination  with  these  previous  approaches. 

2.  A  consistently  beneficial  document  re-ranking  algorithm.  We  described  a 
new  method  based  on  the  graph  Laplacian  for  re-ranking  documents  based  on  im¬ 
proving  local  score  consistency.  We  demonstrated  that  this  algorithm  is  generally 
applicable  and  easily-extendable  into  new  domains. 
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These  two  contributions  have  been  rigorously  tested  across  a  diverse  set  of  retrieval  scenarios. 

We  believe  this  concluding  chapter  is  the  most  appropriate  place  to  nestle  a  few  editorial 
comments  about  the  local  consistency  and  feature-based  retrieval  models.  Feature-based 
models  allow  the  designer  to  use  sophisticated  machine  learning  techniques  to  select  pa¬ 
rameter  values  which  optimize  performance  for  retrieval  evaluation  measures  [Metzler  and 
Croft,  2005;  Yue  et  al.,  2007],  The  resulting  complex  combination  of  features  often  signifi¬ 
cantly  improves  performance.  Unfortunately,  to  date,  these  approaches  lack  one  fundamen¬ 
tal  property  of  term-based  models:  the  ability  to  model  local  consistency.  Similarity  in  term 
space  tends  to  imply  topical  relatedness.  Similarity  in  feature-space  does  not  necessarily 
imply  topical  relatedness.  If  documents  are  only  ever  represented  by  abstract  features,  then 
computing  topical  relationships  is  very  difficult.  We  believe  that  these  models  should  for¬ 
mally  and  directly  incorporate  topical  regularity  objectives  just  as  they  do  other  features  and 
hopefully  in  a  manner  more  elaborate  than  a  single  iteration  of  regularization.  The  beauty 
of  these  models  is  in  their  ability  to  automatically  combine  well-known  design  principles 
from  information  retrieval.  The  cluster  hypothesis  should  not  be  excluded. 

That  said,  we  should  be  explicit  about  the  limitations  of  our  work.  First,  the  improve¬ 
ments  garnered  by  regularization  were  most  visible  at  higher  recall  points.  If  we  are  building 
a  system  for  a  high  precision  task  such  as  web  search,  then  we  will  only  see  marginal  gains 
from  regularization.  However,  there  are  many  retrieval  tasks  for  which  all  recall  points  are 
important;  these  include  legal  search  and  medical  search.  Second,  regularization  requires  a 
meaningful  affinity  matrix,  W,  where  we  expect  labels  or  scores  of  related  documents  to  be 
similar.  There  are  certainly  tasks  where  affinity  measures  are  noisier  (for  example,  sentence 
affinity)  or  not  related  to  scores  at  all  (for  example,  diversity-based  ranking  or  novelty).  Fi¬ 
nally,  some  of  the  methods  in  this  work  are  admittedly  intended  to  obsolesce.  We  aim  to 
provoke  the  incorporation  of  regularization  terms  into  existing  and  future  retrieval  models. 
If  this  dissertation  is  successful,  systems  will  produce  locally  consistent  scores,  preventing 
prediction  by  autocorrelation  or  improvement  by  post-hoc  regularization. 
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APPENDIX  A 


EXPERIMENTAL  DATA 


A.l  Data  for  Detailed  Experiments 

A.  1.1  Topics 

We  performed  experiments  on  two  data  sets.  The  first  data  set,  which  we  will  call 
“tree  12”,  consists  of  the  150  TREC  Ad  Hoc  topics  51-200.  We  used  only  the  news  col¬ 
lections  on  Tipster  disks  1  and  2  [Harman,  1993],  The  second  data  set,  which  we  will  call 
“robust”,  consists  of  the  250  TREC  2004  Robust  topics  [Voorhees,  2004],  We  used  only 
the  news  collections  on  TREC  disks  4  and  5.  The  robust  topics  are  considered  to  be  diffi¬ 
cult  and  have  been  constructed  to  focus  on  topics  which  systems  usually  perform  poorly  on. 
For  both  data  sets,  we  use  only  the  topic  title  held  as  the  query.  The  topic  title  is  a  short, 
keyword  query  associated  with  each  TREC  topic.  We  indexed  collections  using  the  Indri 
retrieval  system,  the  Rainbow  stop  word  list,  and  Krovetz  stemming  [Strohman  et  ah,  2004; 
McCallum,  1996;  Krovetz,  1993], 

A.  1.2  Baselines 

For  these  detailed  experiments,  we  sought  baselines  which  were  strong,  in  the  sense  of 
high  performance,  and  realistic,  in  the  sense  of  not  over-fitting.  Therefore,  we  first  per¬ 
formed  cross-validation  to  construct  baseline  retrieval  scores.  We  report  the  specifics  of 
these  experiments  in  the  subsequent  sections.  We  describe  our  experimental  data  in  Section 
A.  1.1  and  baseline  algorithms  in  Section  2.2. 1-2. 2. 3.  We  present  parameters  for  our  base¬ 
line  algorithms  in  Table  A.  2.  We  also  present  trained  parameter  values  (or  ranges  if  they 
were  different  across  partitions).1 

1  Our  Markov  random  field  baseline  system  uses  a  structured  query  model  which  incorporates  inter-term 
dependencies  [Metzler  and  Croft,  2005].  We  use  the  Indri  query  language  to  implement  full  dependence 
models  with  fixed  parameters  of  (At,  A o,  A u)  =  {0.8,  0.1,  0.1}  as  suggested  by  the  authors  [Metzler  and 

number  of  documents  queries  comments 

tree  12  741,856  51-200  Tipster  disks  1  and  2  without 

government  documents 

robust  472,525  301-450,650-700  TREC  disks  4  and  5  without 

government  documents 

Table  A.l.  Topics  and  corpora  used  in  detailed  experiments. 
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optimal 


range 

tree  1 2 

robust 

Okapi 

b 

[0. 1-1.0;  0.1] 

0.3 

0.3 

k 

[0.5-2. 5;  0.25] 

1. 5-2.0 

0.75 

Query  Likelihood 

I1 

[500-4000;  500] 

2000 

1000 

Relevance  Models 

r 

{5,  25,  50,  100} 

25-50 

5-25 

fh 

{5,  10,  25,  50,  75,  100} 

100 

75-100 

A 

[0. 1-0.7;  0.1] 

0.2 

0. 1-0.2 

Markov  Random  Field 

/^text 

[500-4000;  500] 

500-1500 

3000-4000 

/^window 

[500-4000;  500] 

500-2000 

500 

Table  A.2.  Parameter  sweep  values.  Parameter  ranges  considered  in  the  cross-validation. 
For  each  topic  set,  we  present  the  optimal  parameter  values  selected  during  training.  When 
these  values  were  not  stable  across  partitions,  we  present  the  optimal  parameter  ranges. 


A.2  Data  for  Generaliz ability  Experiments 

A.2.1  Collections  and  Runs 

In  addition  to  our  detailed  experiments,  we  were  interested  in  evaluating  the  gener- 
alizability  of  score  regularization  to  arbitrary  initial  retrieval  algorithms.  To  this  end,  we 
collected  the  document  rankings  for  all  automatic  runs  submitted  to  the  Ad  Hoc  Retrieval 
track  for  TRECs  3-8,  Robust  2003-2005,  Terabyte  2004-2005,  TRECs  3-4  Spanish,  and 
TRECs  5-6  Chinese  [Voorhees  and  Harman,  2001],  This  constitutes  a  variety  of  runs  and 
tasks  with  varying  levels  of  performance.  In  all  cases,  we  use  the  appropriate  evaluation 
corpora,  not  just  the  news  portions  as  in  the  detailed  experiments.  We  also  include  results 
for  the  TREC  2005  Enterprise  track  Entity  Retrieval  subtask.  This  subtask  deals  with  the 
modeling  and  retrieval  of  entities  mentioned  in  an  enterprise  corpus  consisting  of  email  and 
webpages.  Although  all  sites  participating  in  TREC  include  a  score  in  run  submissions,  we 
cannot  be  confident  about  the  accuracy  of  the  scores.  Therefore,  inconsistent  behavior  for 
some  runs  may  be  the  result  of  inaccurate  scores.  We  present  statistics  for  these  collections 
and  runs  in  Table  A.  3. 

Croft,  2005] .  We  focus  training  for  the  Markov  random  field  on  the  feature  parameters  governing  smoothing. 
See  [Metzler  and  Croft,  2005]  for  a  more  detailed  description  of  these  parameters. 
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number  of  documents  queries  number  of  runs 


trec3 

741,856 

151-200 

28 

trec4 

567,529 

201-250 

14 

trec5 

524,929 

251-300 

30 

trecb 

556,077 

301-350 

56 

trec7 

528,155 

351-400 

86 

trec8 

528,155 

401-450 

116 

robust03 

528,155 

601-650,  robust03 

76 

robust04 

528,155 

301-450,  651-700 

80 

robust05 

1,033,461 

robust05 

59 

terabyte04 

25,205,179 

701-750 

70 

terabyte05 

25,205,179 

751-800 

54 

trec4-spanish 

57,868 

26-50 

21 

trec5-spanish 

230,820 

26-50 

18 

trec5-chinese 

164,779 

1-28 

20 

trecb-chinese 

164,779 

29-54 

28 

ent05 

1,092 

1-50 

37 

Table  A.3.  Topics,  corpora,  and  runs  used  in  generalizability  experiments. 


A.2.2  Affinity  Matrices 

We  used  the  cosine  similarity  described  in  Section  2.3.  Non-English  collections  received 
no  linguistic  processing:  tokens  were  broken  on  whitespace  for  Spanish  and  single  characters 
were  used  for  Chinese.  Entity  similarity  is  determined  by  the  cooccurence  of  entity  names 
in  the  corpus. 
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A  matrix 

A i  the  ztli  matrix 

AtJ  element  (  i  -'j)  of  matrix  A 

a  vector 

a i  the  /  th  vector 

a*  element  i  of  vector  a 

a  scalar 

/(A)  element-wise  function  of  A 

A1/2  element-wise  square  root 

A-1  matrix  inverse 

At  matrix  transpose 

|  |a|  U  Li  norm  of  the  vector  a 


Table  B.l.  Notational  convention  for  vector  and  matrix  representation. 


107 


n  number  of  documents 

n  number  of  documents  to  regularize 

to  number  of  terms 

C  n  x  m  collection  matrix 
d,  row  i  of  G  as  a  column  vector 
Wj  column  i  of  C 

1  n  X  1  vector  of  document  lengths 

c  to  X  1  vector  of  term  document  frequencies 

A  n  X  n  document  affinity  matrix 
W  nearest  neighbor  graph  based  on  A 

y  nxl  initial  score  vector 

f  nxl  regularized  score  vector 

U  to  x  k  matrix  of  cluster  vectors 

V  k  X  n  matrix  of  documents  embedded  into  k  dimensions 
yc  k  x  1  cluster  score  vector 

We  n  x  n  graph  based  on  expanded  documents 
ye  nxl  vector  of  scores  for  expanded  documents 

A  n  x  n  Laplacian  on  W 

E/,:  n  x  k  matrix  of  top  k  eigenvectors  of  W 

e  column  vector  of  all  l’s 

I  identity  matrix 


Table  B.2.  Definition  of  Symbols. 
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APPENDIX  C 


EVALUATION 


C.l  Metrics 

Although  the  goal  of  information  retrieval  is  a  classification  of  all  documents  as  relevant 
or  non-relevant,  the  highly  skewed  class  distribution  requires  the  adoption  of  rank-based 
measures.  In  all  experiments  we  will  be  measuring  performance  using  mean  average  preci¬ 
sion  and  interpolated  precision  at  standard  recall  levels.  Following  convention,  we  will  be  selecting 
parameters  to  optimize  mean  average  precision. 

For  all  experiments,  we  evaluate  the  top  1000  documents  retrieved.  Let  this  ranked 
set  of  documents  be  defined  as  the  vector  pq  for  a  particular  query,  q,  such  that  pq  [/’]  is  the 
relevance  judgment  of  the  document  at  rank  i.  That  is,  pq[i]  =  1  if  the  /  th  ranked  document 
is  relevant  and  pq  [/’]  =  0  otherwise.  We  often  want  to  evaluate  the  quality  of  the  ranking 
after  a  user  has  observed  k  documents  in  the  ranking.  Precision  after  k  documents  is  defined 
as, 

1  ^ 

Mp,)  =  r  !>»[;]  (c.i) 

K  i= l 

Recall  after  k  documents  is  defined  as, 


where  \Rq\  is  the  number  of  relevant  documents  for  the  query. 

G.1.1  Mean  Average  Precision 

The  average  precision  for  a  query  is  defined  as, 

-  1  N 

P(p,)  =  nj-f  ZWvJt]  (C.3) 

1-^9 1  k=  1 

where  N  is  the  total  number  of  documents  retrieved.  We  can  combine  the  average  precision 
for  a  set  of  queries  by  using  the  arithmetic  mean, 

v  =  (C.4) 

Ihd  qeQ 

where  Q  is  our  set  of  queries.  We  refer  to  this  as  the  mean  average  precision  (map)  of  a 
particular  algorithm  for  a  set  of  queries. 


K-kiPq) 
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C.1.2  Interpolated  Precision  at  Standard  Recall  Levels 

Although  favoring  systems  which  place  relevant  documents  at  the  beginning  of  the  ranked 
list,  the  mean  average  precision  does  not  provide  a  good  indication  of  performance  at  par¬ 
ticular  recall  levels.  We  would  like  to  to  say,  for  example,  that  a  particular  method  demon¬ 
strates  high  precision  near  the  top  of  the  ranked  list  as  opposed  to  closer  to  the  bottom. 

A  finer-grained  method  for  measuring  ranked  list  performance  is  to  use  the  interpolated 
precision  at  specific  recall  levels.  The  precision  at  a  recall  level  x  refers  to  the  precision  value 
o  fVk(pq)  where  7 tk(pq)  —  x-  For  a  particular  queiy,  the  recall  function  jumps  in  increments 
of  j^-|.  Therefore,  we  must  interpolate  the  precision  using  the  sampled  recall  points.  It  is 
common  practice  in  information  retrieval  to  define  the  interpolated  precision  at  recall  level 
x  as, 


Vx{pq)  =  SUp {Vk(pq)  ■  ^k(pq)  >  x}  (C.5) 

In  practice,  this  results  in  a  monotonically  decreasing  step  function.  We  present  the  in¬ 
terpolated  precision  graphically  as  the  colored  functions  in  Figure  C.l.  The  interpolated 
precision  at  recall  level  x  for  a  set  of  queries  averages  these  values, 

E  =  E  t’M)  (c.6) 

IVI  qeQ 

where  Q  is  our  set  of  queries.  This  is  shown  graphically  in  Figure  C.  1  as  the  black  line.  We 
use  the  convention  of  computing  Vx  at  the  recall  levels, 

{0.00,  0.10,  0.20,  0.30, 0.40,  0.50, 0.60,  0.70,  0.80,  0.90, 1.00}. 


C.2  Cross  Validation 

Whenever  parameters  needed  tuning,  we  performed  1 0-fold  cross-validation.  We  adopt 
a  Platt’s  cross-validation  evaluation  for  training  and  evaluation  [Platt,  2000] .  We  first  ran¬ 
domly  partition  the  queries  for  a  particular  collection.  For  each  partition,  i,  the  algorithm 
is  trained  on  all  but  that  partition  and  is  evaluated  using  that  partition,  i.  For  example, 
if  the  training  phase  considers  the  topics  and  judgments  in  partitions  1-9,  then  the  testing 
phase  uses  the  optimal  parameters  for  partitions  1  -9  to  perform  retrieval  using  the  topics  in 
partition  10.  Using  each  of  the  ten  possible  training  sets  of  size  nine,  we  generate  unique 
evaluation  rankings  for  each  of  the  topics  over  all  partitions.  Evaluation  and  comparison 
was  performed  using  the  union  of  these  ranked  lists. 
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Figure  C.l.  Averaging  interpolated  precision  curves.  The  interpolated  precision  curves  for  two 
queries  are  shown  in  color.  The  average  interpolated  precision  can  be  computed  at  standard  recall 
levels  and  is  depicted  as  the  solid  black  line. 
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tree  12/okapi 


0 

100 

250 

500 

750 

1000 

0.00 

0.7430 

0.7508 

0.7521 

0.7473 

0.7485 

0.7576 

0.10 

0.5086 

0.5188 

0.5180 

0.5211 

0.5204 

0.5245 

0.20 

0.4308 

0.4367 

0.4483 

0.4556 

0.4553 

0.4582 

0.30 

0.3693 

0.3725 

0.3862 

0.3964 

0.3997 

0.4049 

0.40 

0.3045 

0.3095 

0.3215 

0.3331 

0.3436 

0.3462 

0.50 

0.2538 

0.2560 

0.2620 

0.2716 

0.2804 

0.2865 

0.60 

0.2015 

0.2022 

0.2062 

0.2185 

0.2252 

0.2308 

0.70 

0.1420 

0.1428 

0.1458 

0.1524 

0.1600 

0.1653 

0.80 

0.0897 

0.0908 

0.0947 

0.0981 

0.1027 

0.1047 

0.90 

0.0396 

0.0395 

0.0430 

0.0438 

0.0449 

0.0473 

1.00 

0.0042 

0.0042 

0.0042 

0.0052 

0.0052 

0.0046 

map 

0.2600 

0.2632 

0.2693 

0.2754 

0.2788 

0.2834 

Table  D.  1 .  Average  interpolated  precision  at  standard  recall  points  and  mean  average  non- 
interpolated  precision.  This  table  demonstrates  performance  of  regularizing  Okapi  scores 
for  tree  1 2  collection  as  a  function  of  the  number  of  regularized  documents.  Bold  numbers 
indicate  statistically  significant  improvements  in  performance  using  the  Wilcoxon  test  (p  < 
0.05);  italicized  numbers  indicate  statistically  significant  degradations  in  performance. 


robust/okapi 


0 

100 

250 

500 

750 

1000 

0.00 

0.7361 

0.7311 

0.7232 

0.7175 

0.7030 

0.7050 

0.10 

0.5388 

0.5441 

0.5600 

0.5606 

0.5497 

0.5550 

0.20 

0.4346 

0.4512 

0.4641 

0.4589 

0.4546 

0.4600 

0.30 

0.3667 

0.3734 

0.3890 

0.3874 

0.3868 

0.3887 

0.40 

0.2913 

0.3012 

0.3153 

0.3170 

0.3128 

0.3181 

0.50 

0.2353 

0.2507 

0.2615 

0.2669 

0.2615 

0.2727 

0.60 

0.1894 

0.1987 

0.2088 

0.2162 

0.2131 

0.2226 

0.70 

0.1538 

0.1563 

0.1637 

0.1692 

0.1697 

0.1713 

0.80 

0.1059 

0.1057 

0.1106 

0.1167 

0.1202 

0.1185 

0.90 

0.0666 

0.0683 

0.0719 

0.0759 

0.0774 

0.0772 

1.00 

0.0338 

0.0351 

0.0369 

0.0379 

0.0376 

0.0374 

map 

0.2652 

0.2713 

0.2804 

0.2827 

0.2791 

0.2826 

Table  D.2.  Average  interpolated  precision  at  standard  recall  points  and  mean  average  non- 
interpolated  precision.  This  table  demonstrates  performance  of  regularizing  Okapi  scores 
for  robust  collection  as  a  function  of  the  number  of  regularized  documents.  Bold  numbers 
indicate  statistically  significant  improvements  in  performance  using  the  Wilcoxon  test  (p  < 
0.05);  italicized  numbers  indicate  statistically  significant  degradations  in  performance. 
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trecl2/QL 


0 

100 

250 

500 

750 

1000 

0.00 

0.7518 

0.7480 

0.7462 

0.7501 

0.7451 

0.7330 

0.10 

0.4922 

0.5033 

0.5211 

0.5272 

0.5274 

0.5244 

0.20 

0.4163 

0.4266 

0.4494 

0.4593 

0.4620 

0.4626 

0.30 

0.3469 

0.3545 

0.3794 

0.3909 

0.3993 

0.4014 

0.40 

0.2913 

0.2968 

0.3178 

0.3303 

0.3402 

0.3460 

0.50 

0.2325 

0.2387 

0.2501 

0.2658 

0.2742 

0.2818 

0.60 

0.1792 

0.1827 

0.1910 

0.2048 

0.2119 

0.2142 

0.70 

0.1345 

0.1353 

0.1401 

0.1490 

0.1557 

0.1566 

0.80 

0.0953 

0.0958 

0.0991 

0.0990 

0.1081 

0.1104 

0.90 

0.0493 

0.0490 

0.0502 

0.0494 

0.0524 

0.0532 

1.00 

0.0084 

0.0084 

0.0078 

0.0060 

0.0055 

0.0057 

map 

0.2506 

0.2554 

0.2657 

0.2722 

0.2783 

0.2800 

Table  D.3.  Average  interpolated  precision  at  standard  recall  points  and  mean  average  non- 
interpolated  precision.  This  table  demonstrates  performance  of  regularizing  QL  scores  as 
a  function  of  the  number  of  regularized  documents  for  the  tree  12  collection.  Bold  numbers 
indicate  statistically  significant  improvements  in  performance  using  the  Wilcoxon  test  (p  < 
0.05);  italicized  numbers  indicate  statistically  significant  degradations  in  performance. 


robust/QL 


0 

100 

250 

500 

750 

1000 

0.00 

0.7523 

0.7398 

0.7402 

0.7317 

0.7270 

0.7348 

0.10 

0.5420 

0.5633 

0.5652 

0.5642 

0.5631 

0.5692 

0.20 

0.4375 

0.4622 

0.4713 

0.4711 

0.4715 

0.4763 

0.30 

0.3605 

0.3872 

0.4028 

0.4038 

0.4091 

0.4063 

0.40 

0.2843 

0.3131 

0.3281 

0.3340 

0.3377 

0.3411 

0.50 

0.2356 

0.2600 

0.2741 

0.2828 

0.2883 

0.2882 

0.60 

0.1880 

0.2013 

0.2165 

0.2188 

0.2243 

0.2295 

0.70 

0.1477 

0.1533 

0.1657 

0.1692 

0.1739 

0.1736 

0.80 

0.1040 

0.1038 

0.1124 

0.1206 

0.1197 

0.1199 

0.90 

0.0696 

0.0732 

0.0763 

0.0769 

0.0786 

0.0804 

1.00 

0.0398 

0.0427 

0.0417 

0.0411 

0.0403 

0.0405 

map 

0.2649 

0.2788 

0.2885 

0.2909 

0.2933 

0.2955 

Table  D.4.  Average  interpolated  precision  at  standard  recall  points  and  mean  average  non- 
interpolated  precision.  This  table  demonstrates  performance  of  regularizing  QL  scores  as 
a  function  of  the  number  of  regularized  documents  for  the  robust  collection.  Bold  numbers 
indicate  statistically  significant  improvements  in  performance  using  the  Wilcoxon  test  ip  < 
0.05);  italicized  numbers  indicate  statistically  significant  degradations  in  performance. 
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trecl2/RM 


0 

100 

250 

500 

750 

1000 

0.00 

0.7766 

0.7645 

0.7754 

0.7602 

0.7673 

0.7740 

0.10 

0.5489 

0.5623 

0.5609 

0.5600 

0.5613 

0.5575 

0.20 

0.4882 

0.4911 

0.4919 

0.4940 

0.4971 

0.4946 

0.30 

0.4350 

0.4392 

0.4411 

0.4452 

0.4449 

0.4453 

0.40 

0.3797 

0.3819 

0.3894 

0.3945 

0.3947 

0.3929 

0.50 

0.3210 

0.3243 

0.3329 

0.3432 

0.3437 

0.3426 

0.60 

0.2666 

0.2683 

0.2736 

0.2844 

0.2873 

0.2865 

0.70 

0.2017 

0.2053 

0.2089 

0.2171 

0.2187 

0.2171 

0.80 

0.1424 

0.1407 

0.1432 

0.1491 

0.1543 

0.1548 

0.90 

0.0865 

0.0860 

0.0871 

0.0859 

0.0877 

0.0871 

1.00 

0.0098 

0.0098 

0.0097 

0.0095 

0.0094 

0.0102 

map 

0.3154 

0.3176 

0.3203 

0.3248 

0.3257 

0.3252 

Table  D.5.  Average  interpolated  precision  at  standard  recall  points  and  mean  average  non- 
interpolated  precision.  This  table  demonstrates  performance  of  regularizing  RM  scores  as 
a  function  of  the  number  of  regularized  documents  for  the  tree  12  collection.  Bold  numbers 
indicate  statistically  significant  improvements  in  performance  using  the  Wilcoxon  test  (p  < 
0.05);  italicized  numbers  indicate  statistically  significant  degradations  in  performance. 


robust/RM 


0 

100 

250 

500 

750 

1000 

0.00 

0.6926 

0.7005 

0.6879 

0.6909 

0.6931 

0.6904 

0.10 

0.5458 

0.5593 
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Table  D.6.  Average  interpolated  precision  at  standard  recall  points  and  mean  average  non- 
interpolated  precision.  This  table  demonstrates  performance  of  regularizing  RM  scores  as 
a  function  of  the  number  of  regularized  documents  for  the  robust  collection.  Bold  numbers 
indicate  statistically  significant  improvements  in  performance  using  the  Wilcoxon  test  ip  < 
0.05);  italicized  numbers  indicate  statistically  significant  degradations  in  performance. 
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Table  D.7.  Average  interpolated  precision  at  standard  recall  points  and  mean  average 
non-interpolated  precision.  This  table  demonstrates  performance  of  regularizing  Markov 
random  field  scores  for  tree  1 2  collection  as  a  function  of  the  number  of  regularized  docu¬ 
ments.  Bold  numbers  indicate  statistically  significant  improvements  in  performance  using 
the  Wilcoxon  test  (p  <  0.05);  italicized  numbers  indicate  statistically  significant  degradations 
in  performance. 
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Table  D.8.  Average  interpolated  precision  at  standard  recall  points  and  mean  average 
non-interpolated  precision.  This  table  demonstrates  performance  of  regularizing  Markov 
random  field  scores  for  robust  collection  as  a  function  of  the  number  of  regularized  docu¬ 
ments.  Bold  numbers  indicate  statistically  significant  improvements  in  performance  using 
the  Wilcoxon  test  (p  <  0.05);  italicized  numbers  indicate  statistically  significant  degradations 
in  performance. 
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